Authors: "Anish Bhandari, Will Jones, Nicholas Sager"

InΒ [33]:
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
# added the line just for testing - Anish Bhandari
# Required Libraries
library(tidyverse)
library(knitr)
library(kableExtra)
library(ggthemes)
library(caret)
library(janitor)
library(doParallel)

#library(e1071)
#library(class)

IntroductionΒΆ

In the dynamic field of data science, modeling serves as a vital tool for comprehending and predicting intricate relationships among variables. In this project, we will undertake a comprehensive exploration encompassing data processing, exploratory data analysis, and model construction. Our primary objective is to construct robust and reliable models that offer valuable insights and demonstrate accurate predictive capabilities. The initial model will prioritize interpretability, enabling us to extract meaningful explanations. Subsequently, we will develop two additional models that emphasize accurate predictions.

The video presentation for this project can be found at: https://youtu.be/_Rdo4PIEZZI

Data DescriptionΒΆ

Kaggle is used by data scientists and machine learning engineers to discover data, build models, and compete in challenges. One of the most popular competitions in Kaggle is 'House Prices - Advanced Regression Techniques'. As of 6/6/2023, this competition has close to 28K entries.

The Ames Housing dataset was compiled by Dean De Cock and can be found in the link below. There are 2 files - train.csv and a test.csv. Both the datasets have 79 explanatory variables. Sales price is the response variable which is present in train and absent in test. The train dataset has 1460 unique rows and test dataset has 1459 unique rows.For the purpose of our modeling exercise, we will solely utilize the train dataset. However, we have included the test dataset to facilitate the assessment of predictive performance using Kaggle prediction scores.

Dataset Link: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Read the DataΒΆ

InΒ [34]:
# train <- read.csv("https://raw.githubusercontent.com/NickSager/DS_6372_Ames2/master/Data/train.csv")
# test<- read.csv("https://raw.githubusercontent.com/NickSager/DS_6372_Ames2/master/Data/test.csv")

train <- read.csv("Data/train.csv")
test<- read.csv("Data/test.csv")

# Merge the data frames and add a column indicating whether they come from the train or test set
train$train <- 1
test$SalePrice <- NA
test$train <- 0
ames <- rbind(train, test)

# Verify data frame
head(ames)
str(ames)
summary(ames)
A data.frame: 6 Γ— 82
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilitiesβ‹―PoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePricetrain
<int><int><chr><int><int><chr><chr><chr><chr><chr>β‹―<chr><chr><chr><int><int><int><chr><chr><int><dbl>
1160RL65 8450PaveNARegLvlAllPubβ‹―NANA NA 0 22008WDNormal 2085001
2220RL80 9600PaveNARegLvlAllPubβ‹―NANA NA 0 52007WDNormal 1815001
3360RL6811250PaveNAIR1LvlAllPubβ‹―NANA NA 0 92008WDNormal 2235001
4470RL60 9550PaveNAIR1LvlAllPubβ‹―NANA NA 0 22006WDAbnorml1400001
5560RL8414260PaveNAIR1LvlAllPubβ‹―NANA NA 0122008WDNormal 2500001
6650RL8514115PaveNAIR1LvlAllPubβ‹―NAMnPrvShed700102009WDNormal 1430001
'data.frame':	2919 obs. of  82 variables:
 $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
 $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
 $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
 $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
 $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
 $ Alley        : chr  NA NA NA NA ...
 $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
 $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
 $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
 $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
 $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
 $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
 $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
 $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
 $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
 $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
 $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
 $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
 $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
 $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
 $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
 $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
 $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
 $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
 $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
 $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
 $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
 $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
 $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
 $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
 $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
 $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
 $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
 $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
 $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
 $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
 $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
 $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
 $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
 $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
 $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
 $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
 $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
 $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
 $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
 $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
 $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
 $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
 $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
 $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
 $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
 $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
 $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
 $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
 $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
 $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
 $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
 $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
 $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
 $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
 $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
 $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
 $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
 $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
 $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
 $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
 $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
 $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
 $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
 $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolQC       : chr  NA NA NA NA ...
 $ Fence        : chr  NA NA NA NA ...
 $ MiscFeature  : chr  NA NA NA NA ...
 $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
 $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
 $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
 $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
 $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
 $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
 $ train        : num  1 1 1 1 1 1 1 1 1 1 ...
       Id           MSSubClass       MSZoning          LotFrontage    
 Min.   :   1.0   Min.   : 20.00   Length:2919        Min.   : 21.00  
 1st Qu.: 730.5   1st Qu.: 20.00   Class :character   1st Qu.: 59.00  
 Median :1460.0   Median : 50.00   Mode  :character   Median : 68.00  
 Mean   :1460.0   Mean   : 57.14                      Mean   : 69.31  
 3rd Qu.:2189.5   3rd Qu.: 70.00                      3rd Qu.: 80.00  
 Max.   :2919.0   Max.   :190.00                      Max.   :313.00  
                                                      NA's   :486     
    LotArea          Street             Alley             LotShape        
 Min.   :  1300   Length:2919        Length:2919        Length:2919       
 1st Qu.:  7478   Class :character   Class :character   Class :character  
 Median :  9453   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 10168                                                           
 3rd Qu.: 11570                                                           
 Max.   :215245                                                           
                                                                          
 LandContour         Utilities          LotConfig          LandSlope        
 Length:2919        Length:2919        Length:2919        Length:2919       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 Neighborhood        Condition1         Condition2          BldgType        
 Length:2919        Length:2919        Length:2919        Length:2919       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  HouseStyle         OverallQual      OverallCond      YearBuilt   
 Length:2919        Min.   : 1.000   Min.   :1.000   Min.   :1872  
 Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
 Mode  :character   Median : 6.000   Median :5.000   Median :1973  
                    Mean   : 6.089   Mean   :5.565   Mean   :1971  
                    3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2001  
                    Max.   :10.000   Max.   :9.000   Max.   :2010  
                                                                   
  YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
 Min.   :1950   Length:2919        Length:2919        Length:2919       
 1st Qu.:1965   Class :character   Class :character   Class :character  
 Median :1993   Mode  :character   Mode  :character   Mode  :character  
 Mean   :1984                                                           
 3rd Qu.:2004                                                           
 Max.   :2010                                                           
                                                                        
 Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
 Length:2919        Length:2919        Min.   :   0.0   Length:2919       
 Class :character   Class :character   1st Qu.:   0.0   Class :character  
 Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
                                       Mean   : 102.2                     
                                       3rd Qu.: 164.0                     
                                       Max.   :1600.0                     
                                       NA's   :23                         
  ExterCond          Foundation          BsmtQual           BsmtCond        
 Length:2919        Length:2919        Length:2919        Length:2919       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
 Length:2919        Length:2919        Min.   :   0.0   Length:2919       
 Class :character   Class :character   1st Qu.:   0.0   Class :character  
 Mode  :character   Mode  :character   Median : 368.5   Mode  :character  
                                       Mean   : 441.4                     
                                       3rd Qu.: 733.0                     
                                       Max.   :5644.0                     
                                       NA's   :1                          
   BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
 Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:2919       
 1st Qu.:   0.00   1st Qu.: 220.0   1st Qu.: 793.0   Class :character  
 Median :   0.00   Median : 467.0   Median : 989.5   Mode  :character  
 Mean   :  49.58   Mean   : 560.8   Mean   :1051.8                     
 3rd Qu.:   0.00   3rd Qu.: 805.5   3rd Qu.:1302.0                     
 Max.   :1526.00   Max.   :2336.0   Max.   :6110.0                     
 NA's   :1         NA's   :1        NA's   :1                          
  HeatingQC          CentralAir         Electrical          X1stFlrSF   
 Length:2919        Length:2919        Length:2919        Min.   : 334  
 Class :character   Class :character   Class :character   1st Qu.: 876  
 Mode  :character   Mode  :character   Mode  :character   Median :1082  
                                                          Mean   :1160  
                                                          3rd Qu.:1388  
                                                          Max.   :5095  
                                                                        
   X2ndFlrSF       LowQualFinSF        GrLivArea     BsmtFullBath   
 Min.   :   0.0   Min.   :   0.000   Min.   : 334   Min.   :0.0000  
 1st Qu.:   0.0   1st Qu.:   0.000   1st Qu.:1126   1st Qu.:0.0000  
 Median :   0.0   Median :   0.000   Median :1444   Median :0.0000  
 Mean   : 336.5   Mean   :   4.694   Mean   :1501   Mean   :0.4299  
 3rd Qu.: 704.0   3rd Qu.:   0.000   3rd Qu.:1744   3rd Qu.:1.0000  
 Max.   :2065.0   Max.   :1064.000   Max.   :5642   Max.   :3.0000  
                                                    NA's   :2       
  BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr 
 Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.00  
 1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.00  
 Median :0.00000   Median :2.000   Median :0.0000   Median :3.00  
 Mean   :0.06136   Mean   :1.568   Mean   :0.3803   Mean   :2.86  
 3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.00  
 Max.   :2.00000   Max.   :4.000   Max.   :2.0000   Max.   :8.00  
 NA's   :2                                                        
  KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
 Min.   :0.000   Length:2919        Min.   : 2.000   Length:2919       
 1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
 Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
 Mean   :1.045                      Mean   : 6.452                     
 3rd Qu.:1.000                      3rd Qu.: 7.000                     
 Max.   :3.000                      Max.   :15.000                     
                                                                       
   Fireplaces     FireplaceQu         GarageType         GarageYrBlt  
 Min.   :0.0000   Length:2919        Length:2919        Min.   :1895  
 1st Qu.:0.0000   Class :character   Class :character   1st Qu.:1960  
 Median :1.0000   Mode  :character   Mode  :character   Median :1979  
 Mean   :0.5971                                         Mean   :1978  
 3rd Qu.:1.0000                                         3rd Qu.:2002  
 Max.   :4.0000                                         Max.   :2207  
                                                        NA's   :159   
 GarageFinish         GarageCars      GarageArea      GarageQual       
 Length:2919        Min.   :0.000   Min.   :   0.0   Length:2919       
 Class :character   1st Qu.:1.000   1st Qu.: 320.0   Class :character  
 Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
                    Mean   :1.767   Mean   : 472.9                     
                    3rd Qu.:2.000   3rd Qu.: 576.0                     
                    Max.   :5.000   Max.   :1488.0                     
                    NA's   :1       NA's   :1                          
  GarageCond         PavedDrive          WoodDeckSF       OpenPorchSF    
 Length:2919        Length:2919        Min.   :   0.00   Min.   :  0.00  
 Class :character   Class :character   1st Qu.:   0.00   1st Qu.:  0.00  
 Mode  :character   Mode  :character   Median :   0.00   Median : 26.00  
                                       Mean   :  93.71   Mean   : 47.49  
                                       3rd Qu.: 168.00   3rd Qu.: 70.00  
                                       Max.   :1424.00   Max.   :742.00  
                                                                         
 EnclosedPorch      X3SsnPorch       ScreenPorch        PoolArea      
 Min.   :   0.0   Min.   :  0.000   Min.   :  0.00   Min.   :  0.000  
 1st Qu.:   0.0   1st Qu.:  0.000   1st Qu.:  0.00   1st Qu.:  0.000  
 Median :   0.0   Median :  0.000   Median :  0.00   Median :  0.000  
 Mean   :  23.1   Mean   :  2.602   Mean   : 16.06   Mean   :  2.252  
 3rd Qu.:   0.0   3rd Qu.:  0.000   3rd Qu.:  0.00   3rd Qu.:  0.000  
 Max.   :1012.0   Max.   :508.000   Max.   :576.00   Max.   :800.000  
                                                                      
    PoolQC             Fence           MiscFeature           MiscVal        
 Length:2919        Length:2919        Length:2919        Min.   :    0.00  
 Class :character   Class :character   Class :character   1st Qu.:    0.00  
 Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
                                                          Mean   :   50.83  
                                                          3rd Qu.:    0.00  
                                                          Max.   :17000.00  
                                                                            
     MoSold           YrSold       SaleType         SaleCondition     
 Min.   : 1.000   Min.   :2006   Length:2919        Length:2919       
 1st Qu.: 4.000   1st Qu.:2007   Class :character   Class :character  
 Median : 6.000   Median :2008   Mode  :character   Mode  :character  
 Mean   : 6.213   Mean   :2008                                        
 3rd Qu.: 8.000   3rd Qu.:2009                                        
 Max.   :12.000   Max.   :2010                                        
                                                                      
   SalePrice          train       
 Min.   : 34900   Min.   :0.0000  
 1st Qu.:129975   1st Qu.:0.0000  
 Median :163000   Median :1.0000  
 Mean   :180921   Mean   :0.5002  
 3rd Qu.:214000   3rd Qu.:1.0000  
 Max.   :755000   Max.   :1.0000  
 NA's   :1459                     

For data cleaning purposes, we will merge test and train into one dataset, keeping in mind that the 1459 NA's in the SalePrice column are from the test set. We will also add a column to indicate whether the row is from the train or test set.

Data CleaningΒΆ

In order to use a linear regression model, we need to convert all of the categorical variables into dummy variables. We will also remove or impute the NA's in the data set.

InΒ [35]:
# Summarize NA's by  column
ames %>%
  summarise_all(~(sum(is.na(.)))) %>%
  gather(key = "Column", value = "NA_Count", -1) %>%
  filter(NA_Count > 0) %>%
  ggplot(aes(x = reorder(Column, NA_Count), y = NA_Count)) +
  geom_col() +
  coord_flip() +
  theme_gdocs() +
  labs(title = "Number of NA's by Column", x = "Column", y = "NA Count")

# Create a table of the missing NAs by column
ames %>%
  summarise_all(~(sum(is.na(.)))) %>%
  gather(key = "Column", value = "NA_Count", -1) %>%
  filter(NA_Count > 0) %>%
  arrange(desc(NA_Count)) %>%
  select(-Id) # %>% 
  # kable()



library(naniar)
vis_miss(ames[c(2:40)],cluster = TRUE, sort_miss =TRUE)
vis_miss(ames[c(41:81)],cluster = TRUE, sort_miss = TRUE)
A data.frame: 35 Γ— 2
ColumnNA_Count
<chr><int>
PoolQC 2909
MiscFeature 2814
Alley 2721
Fence 2348
SalePrice 1459
FireplaceQu 1420
LotFrontage 486
GarageYrBlt 159
GarageFinish 159
GarageQual 159
GarageCond 159
GarageType 157
BsmtCond 82
BsmtExposure 82
BsmtQual 81
BsmtFinType2 80
BsmtFinType1 79
MasVnrType 24
MasVnrArea 23
MSZoning 4
Utilities 2
BsmtFullBath 2
BsmtHalfBath 2
Functional 2
Exterior1st 1
Exterior2nd 1
BsmtFinSF1 1
BsmtFinSF2 1
BsmtUnfSF 1
TotalBsmtSF 1
Electrical 1
KitchenQual 1
GarageCars 1
GarageArea 1
SaleType 1
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

There are not too many NA's in the data set, and they appear mostly to do with lack of a certain feature. For example, if a house does not have a pool, then the PoolQC column will be NA.

InΒ [36]:
# Imputation

# If pool-related variables are NA, assume there is no pool and assign to 0
ames <- ames %>%
  mutate(
    PoolQC = ifelse(is.na(PoolQC), "None", PoolQC),
    PoolArea = ifelse(is.na(PoolArea), 0, PoolArea),
  )
# If garage-related variables are NA, assume there is no garage and assign to 0
ames <- ames %>%
  mutate(
    GarageType = ifelse(is.na(GarageType), "None", GarageType),
   GarageYrBlt = ifelse(is.na(GarageYrBlt), 1979, GarageYrBlt), #These will be changed to the mean because of large year values
    GarageFinish = ifelse(is.na(GarageFinish), "None", GarageFinish),
    GarageCars = ifelse(is.na(GarageCars), 0, GarageCars),
    GarageArea = ifelse(is.na(GarageArea), 0, GarageArea),
    GarageQual = ifelse(is.na(GarageQual), "None", GarageQual),
    GarageCond = ifelse(is.na(GarageCond), "None", GarageCond)
  )
# If Bsmt-related variables are NA, assume there is no Bsmt and assign to 0, Masvertype to 0, Utilities to All pub which is the most common, and Exterior to other
ames <- ames %>%
  mutate(
    BsmtQual = ifelse(is.na(BsmtQual), "None", BsmtQual),
    BsmtCond = ifelse(is.na(BsmtCond), "None", BsmtCond),
    BsmtExposure = ifelse(is.na(BsmtExposure), "None", BsmtExposure),
    BsmtFinType1 = ifelse(is.na(BsmtFinType1), "None", BsmtFinType1),
    BsmtFinSF1 = ifelse(is.na(BsmtFinSF1), 0, BsmtFinSF1),
    BsmtFinType2 = ifelse(is.na(BsmtFinType2), "None", BsmtFinType2),
    BsmtFinSF2 = ifelse(is.na(BsmtFinSF2), 0, BsmtFinSF2),
    BsmtUnfSF = ifelse(is.na(BsmtUnfSF), 0, BsmtUnfSF),
    BsmtFullBath = ifelse(is.na(BsmtFullBath), 0, BsmtFullBath),
    BsmtHalfBath = ifelse(is.na(BsmtHalfBath), 0, BsmtHalfBath),
    TotalBsmtSF = ifelse(is.na(TotalBsmtSF), 0, TotalBsmtSF),
    LotFrontage = ifelse(is.na(LotFrontage), 0, LotFrontage),
    MasVnrArea = ifelse(is.na(MasVnrArea), 0, MasVnrArea),
    MasVnrType = ifelse(is.na(MasVnrType), "None", MasVnrType),
    Utilities = ifelse(is.na(Utilities), "AllPub", Utilities),
    Exterior1st = ifelse(is.na(Exterior1st), "Other", Exterior1st),
    Exterior2nd = ifelse(is.na(Exterior2nd), "Other", Exterior2nd),
    Electrical = ifelse(is.na(Electrical), "FuseA", Electrical),
  )
# If Fence-related variables are NA, assume there is no Fence and assign to 0
ames <- ames %>%
  mutate(
    Fence = ifelse(is.na(Fence), "None", Fence), 
  )
# If Misc-related variables are NA, assume there is no Misc features and assign to 0
ames <- ames %>%
  mutate(
    MiscFeature = ifelse(is.na(MiscFeature), "None", MiscFeature), 
  )
# If Fireplace-related variables are NA, assume there is no Fireplace and assign to 0
ames <- ames %>%
  mutate(
    FireplaceQu = ifelse(is.na(FireplaceQu), "None", FireplaceQu),
  )
# If Alley-related variables are NA, assume there is no Alley and assign to 0
ames <- ames %>%
  mutate(
    Alley = ifelse(is.na(Alley), "None", Alley),
  )

# Summarize the amount of remaining NA's by column to check what's left
colSums(is.na(ames))

# create a dataset for eda named ameseda
ameseda <- ames[ames$train == 1, ]


# Use the dummyVars() function to convert categorical variables into dummy variables
# Then use janitor::clean_names() to clean up the column names
dummy_model <- dummyVars(~ ., data = ames)
ames_dummy <- as.data.frame(predict(dummy_model, newdata = ames))
ames_dummy <- clean_names(ames_dummy)

# NOTE: Probably could make the case for deleting NAs here -Nick
# Fill in all remaining na values with the mean of the column
ames_dummy <- ames_dummy %>%
  mutate(across(
    c(-sale_price) ,# , -train),
    ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)
  ))

# create ames dataset for modeling to be consistent with team member's terminology
ames <-ames_dummy

# Summary of missing values post imputation and changing into dummy. Sales Price from the 'test' dataset is the only column with missing values. 

gg_miss_var(ames_dummy[,c(1:50)])
gg_miss_var(ames_dummy[,c(51:100)])
gg_miss_var(ames_dummy[,c(101:150)])
gg_miss_var(ames_dummy[,c(151:200)])
gg_miss_var(ames_dummy[,c(201:250)])
gg_miss_var(ames_dummy[,c(250:305)])




vis_miss(ameseda[c(2:40)],cluster = TRUE, sort_miss =TRUE)
vis_miss(ameseda[c(41:81)],cluster = TRUE, sort_miss = TRUE)
Id
0
MSSubClass
0
MSZoning
4
LotFrontage
0
LotArea
0
Street
0
Alley
0
LotShape
0
LandContour
0
Utilities
0
LotConfig
0
LandSlope
0
Neighborhood
0
Condition1
0
Condition2
0
BldgType
0
HouseStyle
0
OverallQual
0
OverallCond
0
YearBuilt
0
YearRemodAdd
0
RoofStyle
0
RoofMatl
0
Exterior1st
0
Exterior2nd
0
MasVnrType
0
MasVnrArea
0
ExterQual
0
ExterCond
0
Foundation
0
BsmtQual
0
BsmtCond
0
BsmtExposure
0
BsmtFinType1
0
BsmtFinSF1
0
BsmtFinType2
0
BsmtFinSF2
0
BsmtUnfSF
0
TotalBsmtSF
0
Heating
0
HeatingQC
0
CentralAir
0
Electrical
0
X1stFlrSF
0
X2ndFlrSF
0
LowQualFinSF
0
GrLivArea
0
BsmtFullBath
0
BsmtHalfBath
0
FullBath
0
HalfBath
0
BedroomAbvGr
0
KitchenAbvGr
0
KitchenQual
1
TotRmsAbvGrd
0
Functional
2
Fireplaces
0
FireplaceQu
0
GarageType
0
GarageYrBlt
0
GarageFinish
0
GarageCars
0
GarageArea
0
GarageQual
0
GarageCond
0
PavedDrive
0
WoodDeckSF
0
OpenPorchSF
0
EnclosedPorch
0
X3SsnPorch
0
ScreenPorch
0
PoolArea
0
PoolQC
0
Fence
0
MiscFeature
0
MiscVal
0
MoSold
0
YrSold
0
SaleType
1
SaleCondition
0
SalePrice
1459
train
0
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Imputation:

Pool related variables: Upon investigation, we discovered that the missing data for pool-related variables followed a MNAR (Missing Not at Random) pattern, specifically in homes without pools. To address this, we replaced the missing values with "none" or 0, depending on the variable type.

Garage related variables: Our investigation revealed that the missing data for garage-related variables also followed a MNAR pattern, particularly in homes without garages. To handle this, we imputed the missing values with "none" or 0, depending on the variable type.

Basement related variables: Similar to the pool and garage variables, the missing data for basement-related variables displayed a MNAR pattern, primarily in homes without basements. We addressed this by replacing the missing values with "none" or 0, based on the variable type.

Additionally, categorical variables such as Fence, Fireplace, and Alley, which exhibited a MNAR pattern, were assigned the value "none."

For variables that followed a MCAR (Missing Completely at Random) pattern and had a relatively low number of missing values, we imputed the missing values with the mean of the variable.

The datasets after imputation and processing were split back to 'test' and 'train'.

Influential points in training data:

InΒ [37]:
ames[524, ] %>% select(sale_price, gr_liv_area)
ames[1299, ] %>% select(sale_price, gr_liv_area)

# Remove the two outliers
ames <- ames[-c(524, 1299), ]
A data.frame: 1 Γ— 2
sale_pricegr_liv_area
<dbl><dbl>
5241847504676
A data.frame: 1 Γ— 2
sale_pricegr_liv_area
<dbl><dbl>
12991600005642

Observations 524 and 1299 were identified as outliers based on their GrLivArea and SalePrice values. These observations were removed from the training dataset due to their very atypical values. We will assume that there is some other reason for these values which isn't accounted for in the data.

Exploratory Data AnalysisΒΆ

Moving forward, we will delve into an exploration of the Ames housing market data, aiming to extract valuable insights. By closely examining the dataset, we aim to uncover key patterns, trends, and relationships that will assist us to robust models.

Numerical Data Analysis IΒΆ

InΒ [38]:
#ameseda_n is used for eda analysis on all numeric variables

ameseda_n <- ameseda %>%
  select_if(function(x) is.numeric(x) || is.integer(x))

#library(gridExtra)

# Preperation values for ggplot 
ames_long <- ameseda_n %>%
 pivot_longer(everything(), names_to = "variable", values_to = "value")

# Set the plot size and aspect ratio
options(repr.plot.width = 10, repr.plot.height = 6)

# Divide the variables into 4 groups

# Group 1
group1 <- c( "MSSubClass", "LotFrontage", "LotArea", "OverallQual", "OverallCond", "YearBuilt", "YearRemodAdd")

# Group 2
group2 <- c("MasVnrArea", "BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "X1stFlrSF", "X2ndFlrSF", "LowQualFinSF")

# Group 3
group3 <- c("GrLivArea", "BsmtFullBath", "BsmtHalfBath", "FullBath", "HalfBath", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd")

# Group 4
group4 <- c("Fireplaces", "GarageYrBlt", "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
            "X3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal", "MoSold", "YrSold", "SalePrice")



# Create plots for each group of variables
plot1 <- ames_long %>% 
  filter(variable %in% group1) %>% 
  ggplot(aes(x = variable, y = value)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = "free") +
  theme(axis.text.x = element_blank()) +
  labs(title = "Boxplots - Group 1", x = "Variables", y = "Values")

plot2 <- ames_long %>% 
  filter(variable %in% group2) %>% 
  ggplot(aes(x = variable, y = value)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = "free") +
  theme(axis.text.x = element_blank()) +
  labs(title = "Boxplots - Group 2", x = "Variables", y = "Values")

plot3 <- ames_long %>% 
  filter(variable %in% group3) %>% 
  ggplot(aes(x = variable, y = value)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = "free") +
  theme(axis.text.x = element_blank()) +
  labs(title = "Boxplots - Group 3", x = "Variables", y = "Values")

plot4 <- ames_long %>% 
  filter(variable %in% group4) %>% 
  ggplot(aes(x = variable, y = value)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = "free") +
  theme(axis.text.x = element_blank()) +
  labs(title = "Boxplots - Group 4", x = "Variables", y = "Values")

# Summary table on all numeric variables from dataset 
library(psych)
describe(ameseda_n)

# Display plots
plot1
plot2
plot3
plot4
A psych: 39 Γ— 13
varsnmeansdmediantrimmedmadminmaxrangeskewkurtosisse
<int><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
Id 114607.305000e+024.216100e+02 730.57.305000e+02 541.1490 1 1460 1459 0.00000000 -1.202466031.103404e+01
MSSubClass 214605.689726e+014.230057e+01 50.04.915240e+01 44.4780 20 190 170 1.40476562 1.564415721.107057e+00
LotFrontage 314605.762329e+013.466430e+01 63.05.793921e+01 25.2042 0 313 313 0.26727232 3.585188109.072063e-01
LotArea 414601.051683e+049.981265e+03 9478.59.563284e+03 2962.2348 130021524521394512.18261502202.262322342.612216e+02
OverallQual 514606.099315e+001.382997e+00 6.06.079623e+00 1.4826 1 10 9 0.21649836 0.087622583.619467e-02
OverallCond 614605.575342e+001.112799e+00 5.05.477740e+00 0.0000 1 9 8 0.69164401 1.092908742.912329e-02
YearBuilt 714601.971268e+033.020290e+01 1973.01.974127e+03 37.0650 1872 2010 138-0.61220121 -0.445657547.904461e-01
YearRemodAdd 814601.984866e+032.064541e+01 1994.01.986369e+03 19.2738 1950 2010 60-0.50252776 -1.274365455.403150e-01
MasVnrArea 914601.031171e+021.807314e+02 0.06.254110e+01 0.0000 0 1600 1600 2.67211701 10.084669174.729956e+00
BsmtFinSF11014604.436397e+024.560981e+02 383.53.860762e+02 568.5771 0 5644 5644 1.68204129 11.056814151.193663e+01
BsmtFinSF21114604.654932e+011.613193e+02 0.01.382705e+00 0.0000 0 1474 1474 4.24652141 20.008864094.221918e+00
BsmtUnfSF1214605.672404e+024.418670e+02 477.55.192885e+02 426.9888 0 2336 2336 0.91837835 0.464511291.156419e+01
TotalBsmtSF1314601.057429e+034.387053e+02 991.51.036695e+03 347.6697 0 6110 6110 1.52112395 13.178856021.148144e+01
X1stFlrSF1414601.162627e+033.865877e+02 1087.01.129991e+03 347.6697 334 4692 4358 1.37392896 5.710132071.011746e+01
X2ndFlrSF1514603.469925e+024.365284e+02 0.02.853639e+02 0.0000 0 2065 2065 0.81135997 -0.559023971.142447e+01
LowQualFinSF1614605.844521e+004.862308e+01 0.00.000000e+00 0.0000 0 572 572 8.99283329 82.828238521.272524e+00
GrLivArea1714601.515464e+035.254804e+02 1464.01.467670e+03 483.3276 334 5642 5308 1.36375364 4.863482791.375245e+01
BsmtFullBath1814604.253425e-015.189106e-01 0.03.921233e-01 0.0000 0 3 3 0.59484237 -0.843291601.358051e-02
BsmtHalfBath1914605.753425e-022.387526e-01 0.00.000000e+00 0.0000 0 2 2 4.09497490 16.309956916.248442e-03
FullBath2014601.565068e+005.509158e-01 2.01.560788e+00 0.0000 0 3 3 0.03648647 -0.861150281.441813e-02
HalfBath2114603.828767e-015.028854e-01 0.03.433219e-01 0.0000 0 2 2 0.67450925 -1.079982351.316111e-02
BedroomAbvGr2214602.866438e+008.157780e-01 3.02.852740e+00 0.0000 0 8 8 0.21135511 2.211988102.134989e-02
KitchenAbvGr2314601.046575e+002.203382e-01 1.01.000000e+00 0.0000 0 3 3 4.47917826 21.421138615.766514e-03
TotRmsAbvGrd2414606.517808e+001.625393e+00 6.06.408390e+00 1.4826 2 14 12 0.67495173 0.868336834.253849e-02
Fireplaces2514606.130137e-016.446664e-01 1.05.342466e-01 1.4826 0 3 3 0.64823107 -0.224406831.687169e-02
GarageYrBlt2614601.978534e+032.399485e+01 1979.01.980998e+03 29.6520 1900 2010 110-0.67020290 -0.270504666.279739e-01
GarageCars2714601.767123e+007.473150e-01 2.01.773973e+00 0.0000 0 4 4-0.34184538 0.211730721.955813e-02
GarageArea2814604.729801e+022.138048e+02 480.04.698082e+02 177.9120 0 1418 1418 0.17961125 0.904468715.595528e+00
WoodDeckSF2914609.424452e+011.253388e+02 0.07.175771e+01 0.0000 0 857 857 1.53820999 2.970417083.280266e+00
OpenPorchSF3014604.666027e+016.625603e+01 25.03.323288e+01 37.0650 0 547 547 2.35948572 8.441491011.733999e+00
EnclosedPorch3114602.195411e+016.111915e+01 0.03.866438e+00 0.0000 0 552 552 3.08352575 10.372634091.599561e+00
X3SsnPorch3214603.409589e+002.931733e+01 0.00.000000e+00 0.0000 0 508 50810.28317840123.062311597.672696e-01
ScreenPorch3314601.506096e+015.575742e+01 0.00.000000e+00 0.0000 0 480 480 4.11374731 18.342607591.459238e+00
PoolArea3414602.758904e+004.017731e+01 0.00.000000e+00 0.0000 0 738 73814.79791829222.191707821.051488e+00
MiscVal3514604.348904e+014.961230e+02 0.00.000000e+00 0.0000 0 15500 1550024.42652237697.640072141.298413e+01
MoSold3614606.321918e+002.703626e+00 6.06.252568e+00 2.9652 1 12 11 0.21161746 -0.410384577.075713e-02
YrSold3714602.007816e+031.328095e+00 2008.02.007770e+03 1.4826 2006 2010 4 0.09607079 -1.193111593.475784e-02
SalePrice3814601.809212e+057.944250e+04163000.01.707833e+0556338.800034900755000720100 1.87900860 6.496789332.079105e+03
train3914601.000000e+000.000000e+00 1.01.000000e+00 0.0000 1 1 0 NaN NaN0.000000e+00
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Upon analyzing the boxplots and summary table, we observe that the majority of the numerical variables exhibit a right-skewed distribution. However, a few variables, namely YearBuilt, YearRemodAdd, GarageYrBlt, and GarageCars, display a left-skewed distribution. Our response variable, SalePrice, also demonstrates a right-skewed distribution and reveals the presence of outliers.

Categorical Data Anaylsis IΒΆ

InΒ [39]:
#creating a dataset for all categorical variables
ameseda_c <- ameseda %>%
  select_if(function(x) is.character(x))


# converting all variables into factor
ameseda_c <- ameseda_c %>% mutate_all(as.factor)

ameseda_c <- ameseda_c %>%
  mutate(SalePrice = ameseda_n$SalePrice)




# Assuming your dataset is stored in the variable 'dataset'
dataset <- ameseda_c
response_variable <- ameseda_c$SalePrice


# Assuming your dataset is stored in the variable 'dataset'
response_variable <- "SalePrice"

# Define the categorical variables (replace with the provided variable names)
categorical_variables <- c("MSZoning", "Street", "Alley", "LotShape", "LandContour", "Utilities", "LotConfig", "LandSlope",
                           "Neighborhood", "Condition1", "Condition2", "BldgType", "HouseStyle", "RoofStyle", "RoofMatl",
                           "Exterior1st", "Exterior2nd", "MasVnrType", "ExterQual", "ExterCond", "Foundation", "BsmtQual",
                           "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "Heating", "HeatingQC", "CentralAir",
                           "Electrical", "KitchenQual", "Functional", "FireplaceQu", "GarageType", "GarageFinish",
                           "GarageQual", "GarageCond", "PavedDrive", "PoolQC", "Fence", "MiscFeature", "SaleType",
                           "SaleCondition")

# Create a list to store the plots
plots <- list()

# Loop through the categorical variables and create a histogram for each
for (variable in categorical_variables) {
  plot <- ggplot(dataset, aes_string(x = response_variable, fill = variable)) +
    geom_histogram(color = "black", bins = 30) +
    labs(title = paste("Histogram of", response_variable, "-", variable),
         x = response_variable, fill = variable) +
    theme_bw()
  
  plots[[variable]] <- plot
}

# Display the plots
for (variable in categorical_variables) {
  print(plots[[variable]])
}


# Loop through the categorical variables and create a scatter plot for each
for (variable in categorical_variables) {
  plot <- ggplot(dataset, aes_string(x = response_variable, y = variable, color = variable)) +
    geom_point() +
    labs(title = paste("Scatter Plot of", response_variable, "vs", variable),
         x = response_variable, y = variable, color = variable) +
    theme_bw()
  
  print(plot)
}





# added this to summarize
#library(psych)
#describe(ameseda_n)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Based on the histograms and scatter plots of the Sales Price when separated by categorical variables, we can identify the following variables that potentially have a good distribution and may be favorable for modeling the Sales Price: MSZoning, RoofStyle, Exterior1st, Exterior2nd, LotShape, LandContour, LotConfig, Neighborhood, BldgType, HouseStyle, HeatingQC, CentralAir, KitchenQual, FireplaceQu, GarageType, GarageFinish, PavedDrive, SaleType, SaleCondition, Condition1, MasVnrType, ExterQual, ExterCond, Foundation, BsmtQual, BsmtCond, BsmtFinType1, Electrical, Functional, GarageQual, GarageCond.

Conversely, the following variables are less likely to be useful in modeling the Sales Price: Street, Alley, Utilities, LandSlope, Condition2, RoofMatl, PoolQC, Fence, MiscFeature, BsmtExposure, BsmtFinType2.These variables may not provide significant insights or exhibit a clear relationship with the Sales Price.

Numerical Data Analysis IIΒΆ

InΒ [40]:
# create correlation plot for the numerical variables
library(corrplot)
corrplot(cor(ameseda_n),tl.cex = 0.6)


# ggpairs based on the corelation plot. We didn't plot every single numerical variable. We chose the ones that had high corelation with SalePrice from the correlation plot
library(GGally)
library(dplyr)



lowerFn <- function(data, mapping, method = "lm", ...) {
  p <- ggplot(data = data, mapping = mapping) +
    geom_point(colour = "blue", size = .2) +
    geom_smooth(method = loess, color = "red", ...)
  p
}

# First plot with selected variables
ameseda_n %>%
  select(SalePrice, OverallQual, LotArea, YearBuilt, GrLivArea) %>%
  ggpairs(lower = list(continuous = lowerFn))

# Second plot with selected variables
ameseda_n %>%
  select(SalePrice, YearRemodAdd, TotalBsmtSF, X1stFlrSF, LowQualFinSF) %>%
  ggpairs(lower = list(continuous = lowerFn))

# Third plot with selected variables
ameseda_n %>%
  select(SalePrice, FullBath, TotRmsAbvGrd, Fireplaces, GarageYrBlt, GarageCars, GarageArea) %>%
  ggpairs(lower = list(continuous = lowerFn))

library(car)
vif(lm(SalePrice ~ OverallQual + YearBuilt + GrLivArea+YearRemodAdd+ X1stFlrSF+TotalBsmtSF+FullBath+TotRmsAbvGrd+GarageCars+Fireplaces+GarageYrBlt + GarageArea + LotArea, data=ameseda_n))
Warning message in cor(ameseda_n):
β€œthe standard deviation is zero”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
No description has been provided for this image
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
No description has been provided for this image
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œpseudoinverse used at -0.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œneighborhood radius 2.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œreciprocal condition number  1.9766e-15”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œThere are other near singularities as well. 4.0602”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œpseudoinverse used at -0.015”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œneighborhood radius 2.015”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œreciprocal condition number  1.9766e-15”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œThere are other near singularities as well. 4.0602”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œpseudoinverse used at -0.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œneighborhood radius 2.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œreciprocal condition number  1.9766e-15”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œThere are other near singularities as well. 4.0602”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œpseudoinverse used at -0.015”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œneighborhood radius 2.015”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œreciprocal condition number  1.9766e-15”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œThere are other near singularities as well. 4.0602”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œpseudoinverse used at -0.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œneighborhood radius 2.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œreciprocal condition number  1.9766e-15”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œThere are other near singularities as well. 4.0602”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œpseudoinverse used at -0.015”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œneighborhood radius 2.015”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œreciprocal condition number  1.9766e-15”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œThere are other near singularities as well. 4.0602”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œpseudoinverse used at -0.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œneighborhood radius 1.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œreciprocal condition number  2.1555e-29”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œThere are other near singularities as well. 1”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œpseudoinverse used at -0.015”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œneighborhood radius 1.015”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œreciprocal condition number  2.1555e-29”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œThere are other near singularities as well. 1”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œpseudoinverse used at -0.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œneighborhood radius 2.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œreciprocal condition number  1.9766e-15”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œThere are other near singularities as well. 4.0602”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œpseudoinverse used at -0.015”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œneighborhood radius 2.015”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œreciprocal condition number  1.9766e-15”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œThere are other near singularities as well. 4.0602”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œpseudoinverse used at -0.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œneighborhood radius 1.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œreciprocal condition number  2.1555e-29”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œThere are other near singularities as well. 1”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œpseudoinverse used at -0.015”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œneighborhood radius 1.015”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œreciprocal condition number  2.1555e-29”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œThere are other near singularities as well. 1”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œpseudoinverse used at -0.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œneighborhood radius 2.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œreciprocal condition number  1.9766e-15”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œThere are other near singularities as well. 4.0602”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œpseudoinverse used at -0.015”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œneighborhood radius 2.015”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œreciprocal condition number  1.9766e-15”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œThere are other near singularities as well. 4.0602”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œpseudoinverse used at -0.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œneighborhood radius 1.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œreciprocal condition number  2.1555e-29”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œThere are other near singularities as well. 1”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œpseudoinverse used at -0.015”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œneighborhood radius 1.015”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œreciprocal condition number  2.1555e-29”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œThere are other near singularities as well. 1”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œpseudoinverse used at -0.02”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œneighborhood radius 2.02”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œreciprocal condition number  2.035e-15”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œThere are other near singularities as well. 1”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œpseudoinverse used at -0.02”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œneighborhood radius 2.02”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œreciprocal condition number  2.035e-15”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œThere are other near singularities as well. 1”
No description has been provided for this image
OverallQual
2.86660642324521
YearBuilt
3.39581864945437
GrLivArea
5.30810776907409
YearRemodAdd
1.89980266530303
X1stFlrSF
3.78272037109803
TotalBsmtSF
3.63026355194396
FullBath
2.27318886436909
TotRmsAbvGrd
3.37884959744182
GarageCars
5.38364319877975
Fireplaces
1.47442743614239
GarageYrBlt
3.08132167667782
GarageArea
5.18777544286162
LotArea
1.17441230848542
No description has been provided for this image

Based on the correlation plot, we observed a relationship between the response variable SalePrice and several predictor variables. It is noteworthy that the relationship appears to be quadratic in most cases, which could be influenced by the presence of high SalePrice values. Among the variables, SalePrice exhibits strong correlations with the following variables: OverallQual (0.791), GrLivArea (0.709),GarageArea(0.623) GarageCars (0.640), X1stFlrSF (0.606), TotalBsmtSF (0.614), FullBath (0.561), TotalRmsAbvGrd (0.534), YearBuilt (0.523), YearRemodAdd (0.507), and GarageYrBlt (mild correlation).

The variable "LotArea" shows a correlation coefficient of 0.264 with the response variable. However, it is important to note that this correlation may be influenced by the presence of outliers with unusually high lot area values. These outliers can have a significant impact on the correlation coefficient, potentially inflating or deflating its magnitude.

Therefore, it is necessary to exercise caution when interpreting the correlation between "LotArea" and the response variable. Further analysis and consideration of the data, including the examination of outliers and their potential influence, would provide a more accurate understanding of the relationship between lot area and the response variable.

To assess multicollinearity among these correlated variables, we performed a multicollinearity analysis. Among them, GrLivArea, GarageCars, and GarageArea exhibited a borderline level of collinearity with a VIF of around 5.

It is important to consider these correlations and collinearity issues when modeling the SalePrice variable. Further analysis and modeling techniques may be necessary to address the quadratic relationships and potential collinearity effects in order to build an accurate predictive model.

Categorical Data Analysis IIΒΆ

InΒ [41]:
# Fit a linear model with categorical variables to validate our visual findings

# Use the dummyVars() function to convert categorical variables into dummy variables
# Then use janitor::clean_names() to clean up the column names
dummy_model <- dummyVars(~ ., data = ameseda_c)
ames_dummy <- as.data.frame(predict(dummy_model, newdata = ameseda_c))
ames_dummy <- clean_names(ames_dummy)


options(max.print = 2000)


cbind(ameseda_n$SalePrice, ames_dummy) %>% #str()
  lm(ameseda_n$SalePrice ~ ., data = .) %>%
  summary()
Call:
lm(formula = ameseda_n$SalePrice ~ ., data = .)

Residuals:
       Min         1Q     Median         3Q        Max 
-1.725e-09 -5.060e-11  4.900e-12  5.470e-11  1.198e-08 

Coefficients: (50 not defined because of singularities)
                          Estimate Std. Error    t value Pr(>|t|)    
(Intercept)              1.850e-09  1.088e-09  1.700e+00 0.089365 .  
`ameseda_n$SalePrice`    1.000e+00  3.283e-16  3.046e+15  < 2e-16 ***
ms_zoning_c_all         -4.339e-11  1.543e-10 -2.810e-01 0.778670    
ms_zoning_fv             2.621e-11  1.223e-10  2.140e-01 0.830385    
ms_zoning_rh            -1.298e-11  1.226e-10 -1.060e-01 0.915720    
ms_zoning_rl            -9.447e-12  6.043e-11 -1.560e-01 0.875789    
ms_zoning_rm                    NA         NA         NA       NA    
street_grvl             -3.317e-12  1.906e-10 -1.700e-02 0.986118    
street_pave                     NA         NA         NA       NA    
alley_grvl              -2.372e-11  9.685e-11 -2.450e-01 0.806554    
alley_none              -2.251e-11  7.612e-11 -2.960e-01 0.767470    
alley_pave                      NA         NA         NA       NA    
lot_shape_ir1           -4.272e-11  2.601e-11 -1.643e+00 0.100736    
lot_shape_ir2           -9.032e-11  6.930e-11 -1.303e+00 0.192668    
lot_shape_ir3           -4.377e-11  1.399e-10 -3.130e-01 0.754519    
lot_shape_reg                   NA         NA         NA       NA    
land_contour_bnk        -6.947e-13  5.971e-11 -1.200e-02 0.990719    
land_contour_hls        -2.948e-11  6.525e-11 -4.520e-01 0.651514    
land_contour_low        -6.675e-11  8.968e-11 -7.440e-01 0.456865    
land_contour_lvl                NA         NA         NA       NA    
utilities_all_pub       -1.786e-10  4.115e-10 -4.340e-01 0.664328    
utilities_no_se_wa              NA         NA         NA       NA    
lot_config_corner        6.613e-12  2.844e-11  2.330e-01 0.816173    
lot_config_cul_d_sac    -6.085e-13  4.814e-11 -1.300e-02 0.989918    
lot_config_fr2           2.222e-10  6.069e-11  3.661e+00 0.000262 ***
lot_config_fr3           3.949e-11  2.028e-10  1.950e-01 0.845653    
lot_config_inside               NA         NA         NA       NA    
land_slope_gtl          -4.967e-11  1.521e-10 -3.270e-01 0.744063    
land_slope_mod          -8.144e-11  1.517e-10 -5.370e-01 0.591405    
land_slope_sev                  NA         NA         NA       NA    
neighborhood_blmngtn    -1.316e-09  1.664e-10 -7.907e+00 5.79e-15 ***
neighborhood_blueste    -1.477e-09  3.165e-10 -4.667e+00 3.39e-06 ***
neighborhood_br_dale    -1.283e-09  1.870e-10 -6.859e+00 1.09e-11 ***
neighborhood_brk_side   -1.407e-09  1.506e-10 -9.342e+00  < 2e-16 ***
neighborhood_clear_cr   -1.349e-09  1.525e-10 -8.847e+00  < 2e-16 ***
neighborhood_collg_cr   -1.394e-09  1.329e-10 -1.048e+01  < 2e-16 ***
neighborhood_crawfor    -1.344e-09  1.424e-10 -9.441e+00  < 2e-16 ***
neighborhood_edwards    -1.409e-09  1.376e-10 -1.024e+01  < 2e-16 ***
neighborhood_gilbert    -1.385e-09  1.386e-10 -9.993e+00  < 2e-16 ***
neighborhood_idotrr     -1.387e-09  1.693e-10 -8.194e+00 6.23e-16 ***
neighborhood_meadow_v   -1.347e-09  1.862e-10 -7.233e+00 8.24e-13 ***
neighborhood_mitchel    -1.390e-09  1.417e-10 -9.809e+00  < 2e-16 ***
neighborhood_n_ames     -1.424e-09  1.322e-10 -1.077e+01  < 2e-16 ***
neighborhood_no_ridge   -1.338e-09  1.444e-10 -9.263e+00  < 2e-16 ***
neighborhood_n_pk_vill  -1.302e-09  2.326e-10 -5.599e+00 2.66e-08 ***
neighborhood_nridg_ht   -1.345e-09  1.406e-10 -9.564e+00  < 2e-16 ***
neighborhood_nw_ames    -1.407e-09  1.356e-10 -1.038e+01  < 2e-16 ***
neighborhood_old_town   -1.388e-09  1.524e-10 -9.113e+00  < 2e-16 ***
neighborhood_sawyer     -1.401e-09  1.373e-10 -1.021e+01  < 2e-16 ***
neighborhood_sawyer_w   -1.348e-09  1.366e-10 -9.871e+00  < 2e-16 ***
neighborhood_somerst    -1.385e-09  1.575e-10 -8.791e+00  < 2e-16 ***
neighborhood_stone_br   -1.272e-09  1.533e-10 -8.299e+00 2.71e-16 ***
neighborhood_swisu      -1.399e-09  1.597e-10 -8.761e+00  < 2e-16 ***
neighborhood_timber     -1.391e-09  1.447e-10 -9.613e+00  < 2e-16 ***
neighborhood_veenker            NA         NA         NA       NA    
condition1_artery        6.544e-11  2.081e-10  3.140e-01 0.753205    
condition1_feedr         2.101e-10  2.012e-10  1.044e+00 0.296637    
condition1_norm          6.196e-11  1.964e-10  3.150e-01 0.752446    
condition1_pos_a         6.213e-11  2.442e-10  2.540e-01 0.799208    
condition1_pos_n         1.303e-10  2.188e-10  5.960e-01 0.551612    
condition1_rr_ae         4.059e-11  2.362e-10  1.720e-01 0.863587    
condition1_rr_an         5.550e-11  2.062e-10  2.690e-01 0.787868    
condition1_rr_ne         1.127e-10  3.367e-10  3.350e-01 0.737971    
condition1_rr_nn                NA         NA         NA       NA    
condition2_artery        1.607e-10  4.332e-10  3.710e-01 0.710750    
condition2_feedr         2.423e-10  3.403e-10  7.120e-01 0.476644    
condition2_norm          1.583e-10  2.877e-10  5.500e-01 0.582287    
condition2_pos_a         3.717e-10  5.900e-10  6.300e-01 0.528804    
condition2_pos_n         6.521e-11  4.153e-10  1.570e-01 0.875258    
condition2_rr_ae         1.349e-11  7.812e-10  1.700e-02 0.986227    
condition2_rr_an         6.096e-11  4.793e-10  1.270e-01 0.898819    
condition2_rr_nn                NA         NA         NA       NA    
bldg_type_1fam           1.025e-10  5.882e-11  1.742e+00 0.081682 .  
bldg_type_2fm_con        6.272e-11  1.017e-10  6.160e-01 0.537681    
bldg_type_duplex         9.062e-11  9.228e-11  9.820e-01 0.326247    
bldg_type_twnhs          6.505e-12  8.128e-11  8.000e-02 0.936231    
bldg_type_twnhs_e               NA         NA         NA       NA    
house_style_1_5fin       1.097e-10  7.102e-11  1.544e+00 0.122812    
house_style_1_5unf       8.286e-11  1.337e-10  6.200e-01 0.535420    
house_style_1story       1.289e-10  5.963e-11  2.161e+00 0.030898 *  
house_style_2_5fin       1.186e-10  1.698e-10  6.990e-01 0.484855    
house_style_2_5unf       6.740e-11  1.513e-10  4.450e-01 0.656171    
house_style_2story       1.030e-10  6.167e-11  1.670e+00 0.095200 .  
house_style_s_foyer      5.239e-11  8.696e-11  6.020e-01 0.546979    
house_style_s_lvl               NA         NA         NA       NA    
roof_style_flat         -1.234e-10  5.512e-10 -2.240e-01 0.822858    
roof_style_gable        -1.397e-11  4.725e-10 -3.000e-02 0.976424    
roof_style_gambrel      -2.528e-12  4.869e-10 -5.000e-03 0.995859    
roof_style_hip          -3.701e-11  4.727e-10 -7.800e-02 0.937606    
roof_style_mansard       8.283e-11  4.682e-10  1.770e-01 0.859597    
roof_style_shed                 NA         NA         NA       NA    
roof_matl_cly_tile      -2.269e-10  5.729e-10 -3.960e-01 0.692175    
roof_matl_comp_shg       2.432e-10  1.838e-10  1.323e+00 0.185935    
roof_matl_membran        2.664e-10  5.535e-10  4.810e-01 0.630445    
roof_matl_metal          3.106e-10  5.294e-10  5.870e-01 0.557604    
roof_matl_roll           1.357e-10  4.524e-10  3.000e-01 0.764319    
roof_matl_tar_grv        4.095e-10  3.425e-10  1.195e+00 0.232167    
roof_matl_wd_shake       2.492e-10  3.000e-10  8.310e-01 0.406373    
roof_matl_wd_shngl              NA         NA         NA       NA    
exterior1st_asb_shng     1.321e-10  2.141e-10  6.170e-01 0.537418    
exterior1st_asph_shn     1.447e-10  5.128e-10  2.820e-01 0.777812    
exterior1st_brk_comm     5.179e-11  4.169e-10  1.240e-01 0.901162    
exterior1st_brk_face     6.385e-11  1.253e-10  5.090e-01 0.610555    
exterior1st_c_block     -4.819e-12  4.329e-10 -1.100e-02 0.991120    
exterior1st_cemnt_bd     1.359e-11  2.516e-10  5.400e-02 0.956940    
exterior1st_hd_board     2.082e-11  1.157e-10  1.800e-01 0.857309    
exterior1st_im_stucc     2.368e-11  4.264e-10  5.600e-02 0.955727    
exterior1st_metal_sd     5.928e-12  1.600e-10  3.700e-02 0.970453    
exterior1st_plywood     -5.699e-11  1.161e-10 -4.910e-01 0.623485    
exterior1st_stone        1.023e-10  3.409e-10  3.000e-01 0.764215    
exterior1st_stucco       8.690e-11  1.573e-10  5.530e-01 0.580654    
exterior1st_vinyl_sd     9.400e-11  1.426e-10  6.590e-01 0.509769    
exterior1st_wd_sdng      3.748e-11  1.071e-10  3.500e-01 0.726564    
exterior1st_wd_shing            NA         NA         NA       NA    
exterior2nd_asb_shng    -5.986e-11  2.007e-10 -2.980e-01 0.765560    
exterior2nd_asph_shn    -6.496e-11  3.132e-10 -2.070e-01 0.835744    
exterior2nd_brk_cmn     -3.677e-11  2.775e-10 -1.330e-01 0.894608    
exterior2nd_brk_face    -1.284e-10  1.373e-10 -9.350e-01 0.350029    
exterior2nd_c_block             NA         NA         NA       NA    
exterior2nd_cment_bd     3.845e-11  2.442e-10  1.570e-01 0.874911    
exterior2nd_hd_board     1.383e-11  1.084e-10  1.280e-01 0.898455    
exterior2nd_im_stucc     4.810e-12  1.631e-10  3.000e-02 0.976470    
exterior2nd_metal_sd     1.359e-10  1.542e-10  8.810e-01 0.378340    
exterior2nd_other       -2.444e-10  4.106e-10 -5.950e-01 0.551744    
exterior2nd_plywood     -2.470e-11  1.030e-10 -2.400e-01 0.810514    
exterior2nd_stone       -3.131e-11  2.167e-10 -1.440e-01 0.885147    
exterior2nd_stucco       1.304e-11  1.493e-10  8.700e-02 0.930394    
exterior2nd_vinyl_sd    -4.176e-11  1.268e-10 -3.290e-01 0.741953    
exterior2nd_wd_sdng      1.550e-11  9.321e-11  1.660e-01 0.867991    
exterior2nd_wd_shng             NA         NA         NA       NA    
mas_vnr_type_brk_cmn    -2.031e-11  1.163e-10 -1.750e-01 0.861403    
mas_vnr_type_brk_face   -1.044e-11  4.595e-11 -2.270e-01 0.820356    
mas_vnr_type_none       -2.179e-12  4.747e-11 -4.600e-02 0.963388    
mas_vnr_type_stone              NA         NA         NA       NA    
exter_qual_ex           -3.434e-11  8.547e-11 -4.020e-01 0.687945    
exter_qual_fa            1.554e-11  1.529e-10  1.020e-01 0.919036    
exter_qual_gd           -7.770e-11  3.926e-11 -1.979e+00 0.048027 *  
exter_qual_ta                   NA         NA         NA       NA    
exter_cond_ex           -3.469e-11  2.765e-10 -1.250e-01 0.900207    
exter_cond_fa           -2.707e-11  9.296e-11 -2.910e-01 0.770945    
exter_cond_gd           -5.852e-11  3.778e-11 -1.549e+00 0.121601    
exter_cond_po            7.901e-11  4.224e-10  1.870e-01 0.851667    
exter_cond_ta                   NA         NA         NA       NA    
foundation_brk_til      -5.724e-12  2.344e-10 -2.400e-02 0.980520    
foundation_c_block       2.799e-11  2.314e-10  1.210e-01 0.903760    
foundation_p_conc       -3.838e-11  2.300e-10 -1.670e-01 0.867498    
foundation_slab          1.128e-11  2.772e-10  4.100e-02 0.967558    
foundation_stone         2.348e-11  2.893e-10  8.100e-02 0.935319    
foundation_wood                 NA         NA         NA       NA    
bsmt_qual_ex             3.303e-11  6.540e-11  5.050e-01 0.613580    
bsmt_qual_fa            -1.630e-11  7.812e-11 -2.090e-01 0.834730    
bsmt_qual_gd             5.814e-11  3.957e-11  1.469e+00 0.142013    
bsmt_qual_none          -4.905e-11  5.484e-10 -8.900e-02 0.928756    
bsmt_qual_ta                    NA         NA         NA       NA    
bsmt_cond_fa            -4.492e-12  6.786e-11 -6.600e-02 0.947236    
bsmt_cond_gd             2.126e-11  5.195e-11  4.090e-01 0.682375    
bsmt_cond_none                  NA         NA         NA       NA    
bsmt_cond_po            -1.228e-10  4.705e-10 -2.610e-01 0.794076    
bsmt_cond_ta                    NA         NA         NA       NA    
bsmt_exposure_av         2.244e-11  3.761e-10  6.000e-02 0.952426    
bsmt_exposure_gd         1.283e-10  3.780e-10  3.400e-01 0.734264    
bsmt_exposure_mn         1.968e-12  3.771e-10  5.000e-03 0.995837    
bsmt_exposure_no        -2.369e-11  3.753e-10 -6.300e-02 0.949691    
bsmt_exposure_none              NA         NA         NA       NA    
bsmt_fin_type1_alq       5.641e-11  3.874e-11  1.456e+00 0.145583    
bsmt_fin_type1_blq      -3.629e-12  4.458e-11 -8.100e-02 0.935126    
bsmt_fin_type1_glq      -4.243e-12  3.262e-11 -1.300e-01 0.896518    
bsmt_fin_type1_lw_q     -1.804e-11  5.601e-11 -3.220e-01 0.747491    
bsmt_fin_type1_none             NA         NA         NA       NA    
bsmt_fin_type1_rec      -6.604e-12  4.577e-11 -1.440e-01 0.885293    
bsmt_fin_type1_unf              NA         NA         NA       NA    
bsmt_fin_type2_alq      -7.360e-11  9.835e-11 -7.480e-01 0.454387    
bsmt_fin_type2_blq      -1.253e-11  7.170e-11 -1.750e-01 0.861244    
bsmt_fin_type2_glq      -9.330e-11  1.219e-10 -7.660e-01 0.444033    
bsmt_fin_type2_lw_q     -2.092e-11  6.253e-11 -3.350e-01 0.738042    
bsmt_fin_type2_none      3.171e-11  3.784e-10  8.400e-02 0.933231    
bsmt_fin_type2_rec      -2.491e-11  6.029e-11 -4.130e-01 0.679564    
bsmt_fin_type2_unf              NA         NA         NA       NA    
heating_floor           -1.697e-10  4.837e-10 -3.510e-01 0.725826    
heating_gas_a           -3.301e-11  2.331e-10 -1.420e-01 0.887394    
heating_gas_w           -4.190e-11  2.488e-10 -1.680e-01 0.866281    
heating_grav            -5.942e-11  2.848e-10 -2.090e-01 0.834746    
heating_oth_w           -6.593e-11  3.742e-10 -1.760e-01 0.860160    
heating_wall                    NA         NA         NA       NA    
heating_qc_ex            5.609e-11  3.325e-11  1.687e+00 0.091905 .  
heating_qc_fa            2.061e-11  7.280e-11  2.830e-01 0.777131    
heating_qc_gd            1.467e-11  3.478e-11  4.220e-01 0.673200    
heating_qc_po           -1.180e-10  4.289e-10 -2.750e-01 0.783299    
heating_qc_ta                   NA         NA         NA       NA    
central_air_n            2.728e-11  6.183e-11  4.410e-01 0.659069    
central_air_y                   NA         NA         NA       NA    
electrical_fuse_a       -1.437e-11  4.713e-11 -3.050e-01 0.760439    
electrical_fuse_f       -8.448e-12  8.895e-11 -9.500e-02 0.924351    
electrical_fuse_p        6.548e-11  2.976e-10  2.200e-01 0.825869    
electrical_mix           6.625e-11  7.170e-10  9.200e-02 0.926394    
electrical_s_brkr               NA         NA         NA       NA    
kitchen_qual_ex         -2.163e-11  6.362e-11 -3.400e-01 0.734001    
kitchen_qual_fa         -1.073e-11  7.895e-11 -1.360e-01 0.891874    
kitchen_qual_gd         -1.759e-11  3.298e-11 -5.340e-01 0.593749    
kitchen_qual_ta                 NA         NA         NA       NA    
functional_maj1          8.874e-11  1.166e-10  7.610e-01 0.446707    
functional_maj2         -2.987e-12  1.990e-10 -1.500e-02 0.988024    
functional_min1         -1.940e-11  7.511e-11 -2.580e-01 0.796263    
functional_min2         -9.135e-12  7.236e-11 -1.260e-01 0.899566    
functional_mod           3.934e-11  1.210e-10  3.250e-01 0.745139    
functional_sev           2.664e-10  4.629e-10  5.760e-01 0.565034    
functional_typ                  NA         NA         NA       NA    
fireplace_qu_ex         -3.732e-12  8.915e-11 -4.200e-02 0.966611    
fireplace_qu_fa         -8.307e-11  7.470e-11 -1.112e+00 0.266367    
fireplace_qu_gd         -5.794e-11  3.562e-11 -1.626e+00 0.104133    
fireplace_qu_none       -5.938e-11  3.409e-11 -1.742e+00 0.081807 .  
fireplace_qu_po         -8.932e-11  9.638e-11 -9.270e-01 0.354213    
fireplace_qu_ta                 NA         NA         NA       NA    
garage_type_2types      -3.122e-11  1.793e-10 -1.740e-01 0.861809    
garage_type_attchd       3.126e-11  6.017e-11  5.200e-01 0.603503    
garage_type_basment      2.547e-11  1.143e-10  2.230e-01 0.823633    
garage_type_built_in     5.996e-11  7.697e-11  7.790e-01 0.436070    
garage_type_car_port     1.972e-11  1.635e-10  1.210e-01 0.904035    
garage_type_detchd       2.039e-11  5.647e-11  3.610e-01 0.718037    
garage_type_none                NA         NA         NA       NA    
garage_finish_fin        1.373e-11  3.860e-11  3.560e-01 0.722227    
garage_finish_none              NA         NA         NA       NA    
garage_finish_r_fn       4.864e-11  3.428e-11  1.419e+00 0.156167    
garage_finish_unf               NA         NA         NA       NA    
garage_qual_ex           4.720e-10  4.709e-10  1.002e+00 0.316358    
garage_qual_fa          -1.724e-11  7.646e-11 -2.250e-01 0.821677    
garage_qual_gd          -1.803e-12  1.226e-10 -1.500e-02 0.988271    
garage_qual_none                NA         NA         NA       NA    
garage_qual_po          -6.771e-11  3.866e-10 -1.750e-01 0.860997    
garage_qual_ta                  NA         NA         NA       NA    
garage_cond_ex          -5.060e-10  5.461e-10 -9.270e-01 0.354315    
garage_cond_fa           5.893e-12  8.769e-11  6.700e-02 0.946435    
garage_cond_gd           2.317e-11  1.481e-10  1.560e-01 0.875708    
garage_cond_none                NA         NA         NA       NA    
garage_cond_po           6.234e-11  2.219e-10  2.810e-01 0.778789    
garage_cond_ta                  NA         NA         NA       NA    
paved_drive_n           -9.486e-12  5.481e-11 -1.730e-01 0.862643    
paved_drive_p            2.423e-11  7.868e-11  3.080e-01 0.758140    
paved_drive_y                   NA         NA         NA       NA    
pool_qc_ex               2.100e-10  3.049e-10  6.890e-01 0.491122    
pool_qc_fa              -1.596e-11  3.995e-10 -4.000e-02 0.968142    
pool_qc_gd               2.270e-10  3.087e-10  7.360e-01 0.462160    
pool_qc_none                    NA         NA         NA       NA    
fence_gd_prv             1.886e-11  5.916e-11  3.190e-01 0.749966    
fence_gd_wo             -1.718e-11  5.756e-11 -2.990e-01 0.765345    
fence_mn_prv            -1.154e-11  3.673e-11 -3.140e-01 0.753495    
fence_mn_ww              1.213e-11  1.205e-10  1.010e-01 0.919800    
fence_none                      NA         NA         NA       NA    
misc_feature_gar2       -1.646e-10  6.727e-10 -2.450e-01 0.806675    
misc_feature_none       -1.565e-10  5.548e-10 -2.820e-01 0.777853    
misc_feature_othr       -1.151e-10  6.331e-10 -1.820e-01 0.855793    
misc_feature_shed       -2.092e-10  5.591e-10 -3.740e-01 0.708278    
misc_feature_ten_c              NA         NA         NA       NA    
sale_type_cod            1.102e-11  6.734e-11  1.640e-01 0.870099    
sale_type_con           -6.292e-10  2.776e-10 -2.267e+00 0.023583 *  
sale_type_con_ld        -3.498e-11  1.442e-10 -2.430e-01 0.808303    
sale_type_con_li        -3.441e-11  1.762e-10 -1.950e-01 0.845216    
sale_type_con_lw         1.441e-11  1.840e-10  7.800e-02 0.937592    
sale_type_cwd            3.045e-11  1.972e-10  1.540e-01 0.877291    
sale_type_new            2.398e-10  2.406e-10  9.970e-01 0.319075    
sale_type_oth           -4.789e-11  2.281e-10 -2.100e-01 0.833751    
sale_type_wd                    NA         NA         NA       NA    
sale_condition_abnorml   2.246e-10  2.397e-10  9.370e-01 0.348864    
sale_condition_adj_land  1.300e-10  3.240e-10  4.010e-01 0.688232    
sale_condition_alloca    2.355e-10  2.721e-10  8.650e-01 0.386982    
sale_condition_family    2.426e-10  2.526e-10  9.600e-01 0.337159    
sale_condition_normal    2.262e-10  2.373e-10  9.530e-01 0.340707    
sale_condition_partial          NA         NA         NA       NA    
sale_price                      NA         NA         NA       NA    
---
Signif. codes:  0 β€˜***’ 0.001 β€˜**’ 0.01 β€˜*’ 0.05 β€˜.’ 0.1 β€˜ ’ 1

Residual standard error: 3.712e-10 on 1241 degrees of freedom
Multiple R-squared:      1,	Adjusted R-squared:      1 
F-statistic: 3.065e+29 on 218 and 1241 DF,  p-value: < 2.2e-16

After converting our categorical variables into dummy variables and running a model to assess their significance, we found that the variable "Neighborhood" emerged as the most significant predictor. This suggests that the neighborhood of a property plays a crucial role in determining the response variable. However, it's important to note that in a more complex model, the significance of "Neighborhood" may be influenced by its interaction with other variables.

Considering the complexity of the model, it is possible that interactions between "Neighborhood" and other variables could yield significant effects on the response variable. Therefore, it would be beneficial to explore these potential interactions and evaluate their significance in order to obtain a more comprehensive understanding of the predictors' impact on the outcome.

Here we can see that all of the neighborhood variables are significant, suggesting that neighborhood is an important factor in determining sale price. Several other categorical values look important including: lot_config_fr2, house_style1story, exter_qual_gd, fireplace_qu_none, and sale_type_con.

TransformationsΒΆ

In the following chunk of code, we are generating new columns to capture transformed versions of different attributes. These transformations will be further analyzed in the subsequent sections.

InΒ [42]:
# Create columns for log(SalePrice) and log(GrLivArea)
ames$log_sale_price <- log(ames$sale_price)
ames$log_gr_liv_area <- log(ames$gr_liv_area)

ames$overall_qual_2 = ames$overall_qual^2
ames$lot_area_2 = ames$lot_area^2
ames$log_lot_area = ames$lot_area %>% log()
# ames$year_built_t = plogis(ames_non_dummy$year_built-1940)
ames$log_total_bsmt_sf = ames$total_bsmt_sf %>% log()
ames$log_garage_area = ames$garage_area %>% log()
ames$log_x1st_flr_sf = ames$x1st_flr_sf %>% log()

Sale Price Vs Gross Living Area by NeighborhoodΒΆ

InΒ [43]:
# Plot Sale Price vs. Gross Living Area colored by neighborhood, omitting rows where SalePrice is NA
# Convert the dataframe from wide format to long format
ames_long <- ames %>% 
  pivot_longer(
    cols = starts_with("neighborhood_"),
    names_to = "Neighborhood",
    values_to = "value"
  ) %>%
  filter(value == 1) %>%  # Keep only rows where the neighborhood dummy variable is 1
  select(-value)  # Remove the 'value' column as it's no longer needed

ames_long %>%
  filter(!is.na(sale_price)) %>%
  ggplot(aes(x = gr_liv_area, y = sale_price, color = Neighborhood)) +
  geom_point(show.legend = FALSE) +
  theme_gdocs() +
  labs(title = "Sale Price vs. Gross Living Area by Neighborhood", x = "Gross Living Area", y = "Sale Price")
No description has been provided for this image

The relationship between Sale Price and Gross Living Area is evident, and we can observe that neighborhoods also exhibit distinct relationships with Sale Price. This reaffirms the significance of the Neighborhood variable as observed in the previous linear model.

Sale Price Vs Gross Living Area by Neighborhood (UnTransformed Vs Transformed)ΒΆ

InΒ [44]:
# Untransformed variables
par(mfrow = c(1, 2))
ames %>%
  ggplot(aes(x = gr_liv_area, y = sale_price)) +
  geom_point() +
  geom_smooth() +
  theme_gdocs() +
  labs(
    title = "Sale Price vs. Gross Living Area",
    x = "Gross Living Area",
    y = "Sale Price"
  )

# Log Transformed
ames %>%
  ggplot(aes(x = log_gr_liv_area, y = log_sale_price)) +
  geom_point() +
  geom_smooth() +
  theme_gdocs() +
  labs(
    title = "Log(Sale Price) vs. Log(Gross Living Area) by Neighborhood",
    x = "Log(Gross Living Area)",
    y = "Log(Sale Price)"
  )
par(mfrow = c(1, 1))
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning message:
β€œRemoved 1459 rows containing non-finite values (`stat_smooth()`).”
Warning message:
β€œRemoved 1459 rows containing missing values (`geom_point()`).”
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning message:
β€œRemoved 1459 rows containing non-finite values (`stat_smooth()`).”
Warning message:
β€œRemoved 1459 rows containing missing values (`geom_point()`).”
No description has been provided for this image
No description has been provided for this image

The log-transformed Sale Price and square footage exhibit a more linear relationship, indicating that utilizing these transformed variables will likely result in a more precise regression model. However, it is important to note that this enhanced accuracy comes at the cost of interpretability, as the transformed variables may be less intuitive to interpret directly.

EDA ConclusionΒΆ

During our exploratory data analysis (EDA), we undertook several important steps. Firstly, we cleaned the data by addressing any inconsistencies or missing values. Next, we examined both numerical and categorical variables, exploring their distributions and identifying potential patterns or trends.

We also assessed the relationships between the explanatory variables and the response variable using visual and mathematical techniques. These analyses provided valuable insights into the dependencies and correlations within the dataset.

By conducting this comprehensive EDA, we have equipped ourselves with the necessary foundation for achieving our two primary objectives: developing an interpretable regression model and constructing a more complex predictive model. The insights gained from our EDA will guide us in selecting meaningful features and formulating effective modeling strategies.

Objective 1: Interpretable Regression ModelΒΆ

For this model, we will fit a linear regression with the variables that we have identified as significant. Because the focus of this model is interpretability, we will not include any interaction terms, polynomials, or other transformations.

Based on the exploratory analysis above, we will include the following variables in the regression model: - gr_liv_area - lot_area - overall_qual - year_built - year_remod_add - total_bsmt_sf - garage_area - garage_cars-tot_rms_grd - all neighborhood dummy variables - lot_config_fr2 - house_style1story - exter_qual_gd - fireplace_qu_none - sale_type_con

InΒ [45]:
# Split the data into training and testing sets
train <- ames %>%
  filter(train == 1) %>%
  select(-train)
test <- ames %>%
  filter(train == 0) %>%
  select(-train)

# Train a linear regression model with caret using CV

predictor_vars <- c(
  "gr_liv_area", "lot_area", "overall_qual", "year_built", "year_remod_add",
  "total_bsmt_sf", "garage_area","garage_cars","tot_rms_abv_grd", #"x1st_flr_sf", "x2nd_flr_sf", #removed for vif
 "lot_config_fr2", "house_style1story", "exter_qual_gd", "fireplace_qu_none", "sale_type_con"
) %>% paste(collapse = " + ")
neighborhood_vars <- grep("neighborhood", colnames(train), value = TRUE) %>% paste(collapse = " + ")
terms <- (paste(predictor_vars, neighborhood_vars, sep = " + ", collapse = " + "))
formula <- as.formula(paste("sale_price ~", terms, "- neighborhood_veenker"))

set.seed(137)
ctrl <- trainControl(method = "cv", number = 10, verboseIter = TRUE)
lmFit <- train(formula, data = train, method = "lm", trControl = ctrl, metric = "RMSE")
summary(lmFit)
confint(lmFit$finalModel)

library(car)
vif(lmFit$finalModel)

# Plot the RMSE for each fold
lmFit$resample %>%
  ggplot(aes(x = (1:10), y = RMSE)) +
  geom_point() +
  geom_line() +
  theme_gdocs() +
  labs(title = "RMSE for each fold", x = "Fold", y = "RMSE")
+ Fold01: intercept=TRUE 
- Fold01: intercept=TRUE 
+ Fold02: intercept=TRUE 
- Fold02: intercept=TRUE 
+ Fold03: intercept=TRUE 
- Fold03: intercept=TRUE 
+ Fold04: intercept=TRUE 
- Fold04: intercept=TRUE 
+ Fold05: intercept=TRUE 
- Fold05: intercept=TRUE 
+ Fold06: intercept=TRUE 
- Fold06: intercept=TRUE 
+ Fold07: intercept=TRUE 
- Fold07: intercept=TRUE 
+ Fold08: intercept=TRUE 
- Fold08: intercept=TRUE 
+ Fold09: intercept=TRUE 
- Fold09: intercept=TRUE 
+ Fold10: intercept=TRUE 
- Fold10: intercept=TRUE 
Aggregating results
Fitting final model on full training set
Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-133463  -14543     267   14083  229636 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -1.477e+06  1.462e+05 -10.098  < 2e-16 ***
gr_liv_area             6.197e+01  3.872e+00  16.006  < 2e-16 ***
lot_area                7.219e-01  9.298e-02   7.764 1.56e-14 ***
overall_qual            1.489e+04  1.074e+03  13.864  < 2e-16 ***
year_built              3.766e+02  6.327e+01   5.952 3.34e-09 ***
year_remod_add          3.633e+02  5.383e+01   6.749 2.17e-11 ***
total_bsmt_sf           2.940e+01  3.083e+00   9.537  < 2e-16 ***
garage_area             2.915e+01  8.774e+00   3.322 0.000915 ***
garage_cars             2.010e+03  2.601e+03   0.773 0.439841    
tot_rms_abv_grd        -1.963e+03  9.365e+02  -2.096 0.036268 *  
lot_config_fr2         -6.850e+03  4.603e+03  -1.488 0.136901    
house_style1story       4.185e+03  2.376e+03   1.762 0.078351 .  
exter_qual_gd          -1.492e+04  2.452e+03  -6.084 1.50e-09 ***
fireplace_qu_none      -4.561e+03  1.989e+03  -2.293 0.021969 *  
sale_type_con           2.274e+04  2.208e+04   1.030 0.303223    
neighborhood_blmngtn   -4.480e+04  1.218e+04  -3.677 0.000245 ***
neighborhood_blueste   -5.290e+04  2.363e+04  -2.239 0.025297 *  
neighborhood_br_dale   -4.722e+04  1.254e+04  -3.767 0.000172 ***
neighborhood_brk_side  -1.630e+04  1.081e+04  -1.508 0.131774    
neighborhood_clear_cr  -2.934e+04  1.126e+04  -2.605 0.009278 ** 
neighborhood_collg_cr  -3.181e+04  9.895e+03  -3.215 0.001333 ** 
neighborhood_crawfor   -4.416e+03  1.068e+04  -0.413 0.679304    
neighborhood_edwards   -2.995e+04  1.027e+04  -2.916 0.003606 ** 
neighborhood_gilbert   -3.742e+04  1.027e+04  -3.644 0.000278 ***
neighborhood_idotrr    -2.954e+04  1.133e+04  -2.608 0.009190 ** 
neighborhood_meadow_v  -3.609e+04  1.240e+04  -2.909 0.003682 ** 
neighborhood_mitchel   -4.235e+04  1.055e+04  -4.013 6.31e-05 ***
neighborhood_n_ames    -3.303e+04  9.901e+03  -3.336 0.000873 ***
neighborhood_no_ridge   1.206e+04  1.076e+04   1.121 0.262531    
neighborhood_n_pk_vill -4.532e+04  1.401e+04  -3.234 0.001247 ** 
neighborhood_nridg_ht   7.301e+03  1.029e+04   0.709 0.478178    
neighborhood_nw_ames   -4.601e+04  1.023e+04  -4.499 7.38e-06 ***
neighborhood_old_town  -3.666e+04  1.059e+04  -3.461 0.000555 ***
neighborhood_sawyer    -3.350e+04  1.034e+04  -3.241 0.001218 ** 
neighborhood_sawyer_w  -3.583e+04  1.035e+04  -3.461 0.000555 ***
neighborhood_somerst   -2.540e+04  1.013e+04  -2.506 0.012313 *  
neighborhood_stone_br   1.859e+04  1.136e+04   1.637 0.101919    
neighborhood_swisu     -3.692e+04  1.187e+04  -3.109 0.001914 ** 
neighborhood_timber    -2.916e+04  1.076e+04  -2.710 0.006805 ** 
---
Signif. codes:  0 β€˜***’ 0.001 β€˜**’ 0.01 β€˜*’ 0.05 β€˜.’ 0.1 β€˜ ’ 1

Residual standard error: 30340 on 1419 degrees of freedom
Multiple R-squared:  0.8581,	Adjusted R-squared:  0.8543 
F-statistic: 225.8 on 38 and 1419 DF,  p-value: < 2.2e-16
A matrix: 39 Γ— 2 of type dbl
2.5 %97.5 %
(Intercept)-1.763431e+06-1.189758e+06
gr_liv_area 5.437895e+01 6.956926e+01
lot_area 5.395554e-01 9.043484e-01
overall_qual 1.278191e+04 1.699522e+04
year_built 2.524593e+02 5.006983e+02
year_remod_add 2.576682e+02 4.688461e+02
total_bsmt_sf 2.335389e+01 3.544940e+01
garage_area 1.193809e+01 4.635941e+01
garage_cars-3.092948e+03 7.113031e+03
tot_rms_abv_grd-3.800020e+03-1.257428e+02
lot_config_fr2-1.587970e+04 2.178861e+03
house_style1story-4.752613e+02 8.846175e+03
exter_qual_gd-1.973187e+04-1.011022e+04
fireplace_qu_none-8.462397e+03-6.598585e+02
sale_type_con-2.057103e+04 6.604946e+04
neighborhood_blmngtn-6.870347e+04-2.090313e+04
neighborhood_blueste-9.924745e+04-6.557713e+03
neighborhood_br_dale-7.180779e+04-2.262764e+04
neighborhood_brk_side-3.749566e+04 4.902204e+03
neighborhood_clear_cr-5.142601e+04-7.247184e+03
neighborhood_collg_cr-5.122557e+04-1.240342e+04
neighborhood_crawfor-2.536286e+04 1.653178e+04
neighborhood_edwards-5.010621e+04-9.800132e+03
neighborhood_gilbert-5.756889e+04-1.727781e+04
neighborhood_idotrr-5.176178e+04-7.326374e+03
neighborhood_meadow_v-6.042013e+04-1.175223e+04
neighborhood_mitchel-6.305431e+04-2.164816e+04
neighborhood_n_ames-5.244687e+04-1.360404e+04
neighborhood_no_ridge-9.046799e+03 3.316849e+04
neighborhood_n_pk_vill-7.281350e+04-1.783595e+04
neighborhood_nridg_ht-1.288705e+04 2.748901e+04
neighborhood_nw_ames-6.607272e+04-2.595021e+04
neighborhood_old_town-5.743547e+04-1.587850e+04
neighborhood_sawyer-5.377424e+04-1.322397e+04
neighborhood_sawyer_w-5.614498e+04-1.552221e+04
neighborhood_somerst-4.527509e+04-5.518848e+03
neighborhood_stone_br-3.690534e+03 4.086611e+04
neighborhood_swisu-6.021113e+04-1.362483e+04
neighborhood_timber-5.025784e+04-8.052844e+03
gr_liv_area
6.11969596779814
lot_area
1.33000515946677
overall_qual
3.45775522115555
year_built
5.77631976937239
year_remod_add
1.9537417499232
total_bsmt_sf
2.5905061543776
garage_area
5.48761424547523
garage_cars
5.97790450731877
tot_rms_abv_grd
3.62442415580001
lot_config_fr2
1.04677009206856
house_style1story
2.23497852236574
exter_qual_gd
2.12105019203486
fireplace_qu_none
1.56150065862576
sale_type_con
1.05752986809444
neighborhood_blmngtn
2.70916191149184
neighborhood_blueste
1.2109178183108
neighborhood_br_dale
2.70100266329667
neighborhood_brk_side
7.06486582435775
neighborhood_clear_cr
3.78252961365224
neighborhood_collg_cr
14.3125822413063
neighborhood_crawfor
6.09594664621172
neighborhood_edwards
10.4801137038149
neighborhood_gilbert
8.55991313432559
neighborhood_idotrr
5.02474712699903
neighborhood_meadow_v
2.8083936439461
neighborhood_mitchel
5.72923615321761
neighborhood_n_ames
20.2594427028515
neighborhood_no_ridge
5.01133360275701
neighborhood_n_pk_vill
1.90783894628335
neighborhood_nridg_ht
8.39058585283584
neighborhood_nw_ames
7.87787061297265
neighborhood_old_town
12.7042370817657
neighborhood_sawyer
8.15108214729673
neighborhood_sawyer_w
6.59278330446856
neighborhood_somerst
9.02658161449665
neighborhood_stone_br
3.44247827334504
neighborhood_swisu
3.76324729504303
neighborhood_timber
4.65221482319534
No description has been provided for this image

Summary table of coefficients.

InΒ [46]:
# Summary table of coefficients

# Create a tidy data frame from the model and round the numbers
tidy_fit <- lmFit$finalModel %>%
  broom::tidy() %>%
  mutate(across(where(is.numeric), ~round(., 4)))

# Create a table with bolded rows for p-value < 0.05
table <- tidy_fit %>%
  kable("html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F) %>%
  row_spec(which(tidy_fit$p.value < 0.05), bold = T)

table
<table class="table table-striped table-hover table-condensed table-responsive" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;font-weight: bold;"> (Intercept) </td>
   <td style="text-align:right;font-weight: bold;"> -1476594.2891 </td>
   <td style="text-align:right;font-weight: bold;"> 146222.8999 </td>
   <td style="text-align:right;font-weight: bold;"> -10.0982 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> gr_liv_area </td>
   <td style="text-align:right;font-weight: bold;"> 61.9741 </td>
   <td style="text-align:right;font-weight: bold;"> 3.8718 </td>
   <td style="text-align:right;font-weight: bold;"> 16.0064 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> lot_area </td>
   <td style="text-align:right;font-weight: bold;"> 0.7220 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0930 </td>
   <td style="text-align:right;font-weight: bold;"> 7.7644 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> overall_qual </td>
   <td style="text-align:right;font-weight: bold;"> 14888.5629 </td>
   <td style="text-align:right;font-weight: bold;"> 1073.9280 </td>
   <td style="text-align:right;font-weight: bold;"> 13.8637 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> year_built </td>
   <td style="text-align:right;font-weight: bold;"> 376.5788 </td>
   <td style="text-align:right;font-weight: bold;"> 63.2734 </td>
   <td style="text-align:right;font-weight: bold;"> 5.9516 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> year_remod_add </td>
   <td style="text-align:right;font-weight: bold;"> 363.2571 </td>
   <td style="text-align:right;font-weight: bold;"> 53.8269 </td>
   <td style="text-align:right;font-weight: bold;"> 6.7486 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> total_bsmt_sf </td>
   <td style="text-align:right;font-weight: bold;"> 29.4016 </td>
   <td style="text-align:right;font-weight: bold;"> 3.0830 </td>
   <td style="text-align:right;font-weight: bold;"> 9.5367 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> garage_area </td>
   <td style="text-align:right;font-weight: bold;"> 29.1488 </td>
   <td style="text-align:right;font-weight: bold;"> 8.7736 </td>
   <td style="text-align:right;font-weight: bold;"> 3.3223 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0009 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> garage_cars </td>
   <td style="text-align:right;"> 2010.0417 </td>
   <td style="text-align:right;"> 2601.3932 </td>
   <td style="text-align:right;"> 0.7727 </td>
   <td style="text-align:right;"> 0.4398 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> tot_rms_abv_grd </td>
   <td style="text-align:right;font-weight: bold;"> -1962.8816 </td>
   <td style="text-align:right;font-weight: bold;"> 936.5334 </td>
   <td style="text-align:right;font-weight: bold;"> -2.0959 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0363 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> lot_config_fr2 </td>
   <td style="text-align:right;"> -6850.4208 </td>
   <td style="text-align:right;"> 4602.9317 </td>
   <td style="text-align:right;"> -1.4883 </td>
   <td style="text-align:right;"> 0.1369 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> house_style1story </td>
   <td style="text-align:right;"> 4185.4567 </td>
   <td style="text-align:right;"> 2375.9327 </td>
   <td style="text-align:right;"> 1.7616 </td>
   <td style="text-align:right;"> 0.0784 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> exter_qual_gd </td>
   <td style="text-align:right;font-weight: bold;"> -14921.0417 </td>
   <td style="text-align:right;font-weight: bold;"> 2452.4548 </td>
   <td style="text-align:right;font-weight: bold;"> -6.0841 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> fireplace_qu_none </td>
   <td style="text-align:right;font-weight: bold;"> -4561.1277 </td>
   <td style="text-align:right;font-weight: bold;"> 1988.7823 </td>
   <td style="text-align:right;font-weight: bold;"> -2.2934 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0220 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sale_type_con </td>
   <td style="text-align:right;"> 22739.2140 </td>
   <td style="text-align:right;"> 22078.6211 </td>
   <td style="text-align:right;"> 1.0299 </td>
   <td style="text-align:right;"> 0.3032 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_blmngtn </td>
   <td style="text-align:right;font-weight: bold;"> -44803.3000 </td>
   <td style="text-align:right;font-weight: bold;"> 12183.7882 </td>
   <td style="text-align:right;font-weight: bold;"> -3.6773 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0002 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_blueste </td>
   <td style="text-align:right;font-weight: bold;"> -52902.5804 </td>
   <td style="text-align:right;font-weight: bold;"> 23625.6062 </td>
   <td style="text-align:right;font-weight: bold;"> -2.2392 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0253 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_br_dale </td>
   <td style="text-align:right;font-weight: bold;"> -47217.7155 </td>
   <td style="text-align:right;font-weight: bold;"> 12535.4866 </td>
   <td style="text-align:right;font-weight: bold;"> -3.7667 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0002 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> neighborhood_brk_side </td>
   <td style="text-align:right;"> -16296.7257 </td>
   <td style="text-align:right;"> 10806.7538 </td>
   <td style="text-align:right;"> -1.5080 </td>
   <td style="text-align:right;"> 0.1318 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_clear_cr </td>
   <td style="text-align:right;font-weight: bold;"> -29336.5949 </td>
   <td style="text-align:right;font-weight: bold;"> 11260.7015 </td>
   <td style="text-align:right;font-weight: bold;"> -2.6052 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0093 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_collg_cr </td>
   <td style="text-align:right;font-weight: bold;"> -31814.4945 </td>
   <td style="text-align:right;font-weight: bold;"> 9895.3420 </td>
   <td style="text-align:right;font-weight: bold;"> -3.2151 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0013 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> neighborhood_crawfor </td>
   <td style="text-align:right;"> -4415.5412 </td>
   <td style="text-align:right;"> 10678.4877 </td>
   <td style="text-align:right;"> -0.4135 </td>
   <td style="text-align:right;"> 0.6793 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_edwards </td>
   <td style="text-align:right;font-weight: bold;"> -29953.1689 </td>
   <td style="text-align:right;font-weight: bold;"> 10273.5802 </td>
   <td style="text-align:right;font-weight: bold;"> -2.9156 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0036 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_gilbert </td>
   <td style="text-align:right;font-weight: bold;"> -37423.3500 </td>
   <td style="text-align:right;font-weight: bold;"> 10269.7604 </td>
   <td style="text-align:right;font-weight: bold;"> -3.6440 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0003 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_idotrr </td>
   <td style="text-align:right;font-weight: bold;"> -29544.0795 </td>
   <td style="text-align:right;font-weight: bold;"> 11326.1032 </td>
   <td style="text-align:right;font-weight: bold;"> -2.6085 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0092 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_meadow_v </td>
   <td style="text-align:right;font-weight: bold;"> -36086.1806 </td>
   <td style="text-align:right;font-weight: bold;"> 12404.9167 </td>
   <td style="text-align:right;font-weight: bold;"> -2.9090 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0037 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_mitchel </td>
   <td style="text-align:right;font-weight: bold;"> -42351.2343 </td>
   <td style="text-align:right;font-weight: bold;"> 10553.9768 </td>
   <td style="text-align:right;font-weight: bold;"> -4.0128 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0001 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_n_ames </td>
   <td style="text-align:right;font-weight: bold;"> -33025.4550 </td>
   <td style="text-align:right;font-weight: bold;"> 9900.6170 </td>
   <td style="text-align:right;font-weight: bold;"> -3.3357 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0009 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> neighborhood_no_ridge </td>
   <td style="text-align:right;"> 12060.8469 </td>
   <td style="text-align:right;"> 10760.2191 </td>
   <td style="text-align:right;"> 1.1209 </td>
   <td style="text-align:right;"> 0.2625 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_n_pk_vill </td>
   <td style="text-align:right;font-weight: bold;"> -45324.7249 </td>
   <td style="text-align:right;font-weight: bold;"> 14013.1806 </td>
   <td style="text-align:right;font-weight: bold;"> -3.2344 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0012 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> neighborhood_nridg_ht </td>
   <td style="text-align:right;"> 7300.9826 </td>
   <td style="text-align:right;"> 10291.4188 </td>
   <td style="text-align:right;"> 0.7094 </td>
   <td style="text-align:right;"> 0.4782 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_nw_ames </td>
   <td style="text-align:right;font-weight: bold;"> -46011.4606 </td>
   <td style="text-align:right;font-weight: bold;"> 10226.7919 </td>
   <td style="text-align:right;font-weight: bold;"> -4.4991 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_old_town </td>
   <td style="text-align:right;font-weight: bold;"> -36656.9849 </td>
   <td style="text-align:right;font-weight: bold;"> 10592.4212 </td>
   <td style="text-align:right;font-weight: bold;"> -3.4607 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0006 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_sawyer </td>
   <td style="text-align:right;font-weight: bold;"> -33499.1080 </td>
   <td style="text-align:right;font-weight: bold;"> 10335.8225 </td>
   <td style="text-align:right;font-weight: bold;"> -3.2411 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0012 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_sawyer_w </td>
   <td style="text-align:right;font-weight: bold;"> -35833.5952 </td>
   <td style="text-align:right;font-weight: bold;"> 10354.3025 </td>
   <td style="text-align:right;font-weight: bold;"> -3.4607 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0006 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_somerst </td>
   <td style="text-align:right;font-weight: bold;"> -25396.9710 </td>
   <td style="text-align:right;font-weight: bold;"> 10133.4350 </td>
   <td style="text-align:right;font-weight: bold;"> -2.5063 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0123 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> neighborhood_stone_br </td>
   <td style="text-align:right;"> 18587.7891 </td>
   <td style="text-align:right;"> 11357.0050 </td>
   <td style="text-align:right;"> 1.6367 </td>
   <td style="text-align:right;"> 0.1019 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_swisu </td>
   <td style="text-align:right;font-weight: bold;"> -36917.9793 </td>
   <td style="text-align:right;font-weight: bold;"> 11874.3430 </td>
   <td style="text-align:right;font-weight: bold;"> -3.1091 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0019 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> neighborhood_timber </td>
   <td style="text-align:right;font-weight: bold;"> -29155.3435 </td>
   <td style="text-align:right;font-weight: bold;"> 10757.5958 </td>
   <td style="text-align:right;font-weight: bold;"> -2.7102 </td>
   <td style="text-align:right;font-weight: bold;"> 0.0068 </td>
  </tr>
</tbody>
</table>

Note that the reference Neighborhood is Veenker, so all neighborhood adjustments are relative to it.

The interpretable linear regression does moderately well. Its RMSE from 10-fold cross validation on the training data is $30,340. This means that the model is within about $60,000 95% of the time. Given that the mean sale price for a house in Ames during the time period covered by our dataset is $180,000, the RMSE implies that the predicted price is within 33% of the actual price 95% of the time. This is not great, but it is a good starting point.

The benefit of this type of model is its interpretability. To demonstrate this, we will interpret one numerical coefficient and one categorical coefficient.

  • Holding all other variables constant a one hundred square foot increase in gross living area is associated with a $4,593 increase in sale price (p < 0.001 from linear regression). Based on our model, we can be 95% confidence that the true increase in sale price is between $4,171 and $5,013 for a one hundred square foot increase in gross living area.

  • Holding all other variables constant, being located in the Old Town neighborhood is associated with a $47,703 decrease in sale price compared to a house in the Veenker neighborhood (p < 0.001 from linear regression). Based on our model, we can be 95% confident that the true decrease in sale price is between $35,888 and $59,518 for a house in the Old Town neighborhood compared to a house in the Veenker neighborhood.

Because we are using a linear regression model, we must check the assumptions of the model:

InΒ [47]:
plot(lmFit$finalModel)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

The residuals for this model show some evidence of non-linearity and non-constant variance (heteroscedasticity). There is no evidence of non-normality, and there are no influential points that need to be addressed. We will address the issues in the next section when including transformations in our model.

Objective 2: Predictive ModelΒΆ

To fit a linear model with more complexity, we included the transformations dicussed in the EDA. This includes using the log of sale price, gross living area, and the other areas measured. Using log transformations will make it difficult to interpret the coefficients, but it will result in better predictions based on the realtionships shown below.

EDA for transformed continuous variables:

InΒ [48]:
# ames_non_dummy <- ames[sapply(ames, calculate_range) != 1]
train %>%
  select(log_sale_price, log_gr_liv_area, log_lot_area, overall_qual_2, overall_cond,
  year_built, year_remod_add, log_total_bsmt_sf, log_garage_area, bedroom_abv_gr, log_x1st_flr_sf) %>%
  ggpairs(lower=list(continuous=lowerFn))
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œpseudoinverse used at 5”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œneighborhood radius 1”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œreciprocal condition number  0”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œThere are other near singularities as well. 1”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œpseudoinverse used at 5”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œneighborhood radius 1”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œreciprocal condition number  0”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œThere are other near singularities as well. 1”
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œpseudoinverse used at 5”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œneighborhood radius 1”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œreciprocal condition number  0”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œThere are other near singularities as well. 1”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œpseudoinverse used at 5”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œneighborhood radius 1”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œreciprocal condition number  0”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œThere are other near singularities as well. 1”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 37 rows containing non-finite values (`stat_smooth()`).”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 37 rows containing non-finite values (`stat_smooth()`).”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 37 rows containing non-finite values (`stat_smooth()`).”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 37 rows containing non-finite values (`stat_smooth()`).”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œpseudoinverse used at 36”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œneighborhood radius 13”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œreciprocal condition number  3.6457e-15”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œpseudoinverse used at 36”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œneighborhood radius 13”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œreciprocal condition number  3.6457e-15”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 37 rows containing non-finite values (`stat_smooth()`).”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œpseudoinverse used at 5”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œneighborhood radius 1”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œreciprocal condition number  0”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œThere are other near singularities as well. 1”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œpseudoinverse used at 5”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œneighborhood radius 1”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œreciprocal condition number  0”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œThere are other near singularities as well. 1”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 37 rows containing non-finite values (`stat_smooth()`).”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 37 rows containing non-finite values (`stat_smooth()`).”
Warning message:
β€œRemoved 37 rows containing non-finite values (`stat_density()`).”
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 81 rows containing non-finite values (`stat_smooth()`).”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 81 rows containing non-finite values (`stat_smooth()`).”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 81 rows containing non-finite values (`stat_smooth()`).”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 81 rows containing non-finite values (`stat_smooth()`).”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œpseudoinverse used at 36”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œneighborhood radius 13”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œreciprocal condition number  5.4002e-15”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œpseudoinverse used at 36”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œneighborhood radius 13”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œreciprocal condition number  5.4002e-15”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 81 rows containing non-finite values (`stat_smooth()`).”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œpseudoinverse used at 5”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œneighborhood radius 1”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œreciprocal condition number  0”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œThere are other near singularities as well. 1”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œpseudoinverse used at 5”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œneighborhood radius 1”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œreciprocal condition number  0”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œThere are other near singularities as well. 1”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 81 rows containing non-finite values (`stat_smooth()`).”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 81 rows containing non-finite values (`stat_smooth()`).”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 111 rows containing non-finite values (`stat_smooth()`).”
Warning message:
β€œRemoved 81 rows containing non-finite values (`stat_density()`).”
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
Warning message:
β€œRemoved 1 rows containing missing values (`geom_text()`).”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œpseudoinverse used at 5”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œneighborhood radius 1”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œreciprocal condition number  0”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œThere are other near singularities as well. 1”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œpseudoinverse used at 5”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œneighborhood radius 1”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œreciprocal condition number  0”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œThere are other near singularities as well. 1”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 37 rows containing non-finite values (`stat_smooth()`).”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 81 rows containing non-finite values (`stat_smooth()`).”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œpseudoinverse used at 5”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œneighborhood radius 1”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œreciprocal condition number  0”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œThere are other near singularities as well. 1”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œpseudoinverse used at 5”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œneighborhood radius 1”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œreciprocal condition number  0”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œThere are other near singularities as well. 1”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 37 rows containing non-finite values (`stat_smooth()`).”
`geom_smooth()` using formula = 'y ~ x'
Warning message:
β€œRemoved 81 rows containing non-finite values (`stat_smooth()`).”
`geom_smooth()` using formula = 'y ~ x'
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œpseudoinverse used at 2”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œneighborhood radius 1”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œreciprocal condition number  0”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
β€œThere are other near singularities as well. 1”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œpseudoinverse used at 2”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œneighborhood radius 1”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œreciprocal condition number  0”
Warning message in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
β€œThere are other near singularities as well. 1”
No description has been provided for this image

EDA For interactions?: Running interaction model in the background. Will include if it finishes in time.

Complex LR with feature selection:

InΒ [49]:
# Define variables to be used and create formula
predictor_vars <- c(
  "log_gr_liv_area", "log_lot_area", "overall_qual_2", "year_built", "year_remod_add",
  "log_total_bsmt_sf", "log_garage_area", "lot_config_fr2", "house_style1story", "exter_qual_gd",
  "fireplace_qu_none", "sale_type_con"#, ". -sale_price" #Can include "." to make really complex
) %>% paste(collapse = " + ")
neighborhood_vars <- grep("neighborhood", colnames(train), value = TRUE) %>% paste(collapse = " + ")
terms <- (paste(predictor_vars, neighborhood_vars, sep = " + "))
formula <- as.formula(paste("log_sale_price ~", terms, "- neighborhood_veenker"))

# Complex LR with stepwise selection

# Check if the model object exists, train if it doesn't
if (file.exists("Models/lm_complex.rds")) {
  # Load the model object from disk
  lmComp <- readRDS("Models/lm_complex.rds")
} else {
  # Set up a parallel backend with the number of cores you want to use
  cores <- 8 # Change this to the number of cores you want to use
  cl <- makePSOCKcluster(cores)
  registerDoParallel(cl)

  set.seed(137)
  lmComp <- train(formula,
    data = train,
    method = "glmnet",
    trControl = trainControl(method = "cv", number = 5, allowParallel = TRUE),
    direction = "both",
    penter = 0.05 # Not Working.
  )
  
  # Stop the parallel backend
  stopCluster(cl)
  
  # Save the model object to disk
  saveRDS(lmComp, "Models/lm_complex.rds")
}

defaultSummary(data.frame(pred = predict(lmComp), obs = train$log_sale_price))
varImp(lmComp$finalModel) %>%
  filter(Overall > 0) %>%
  arrange(desc(Overall))

# Glmnet Regression model summary
lmComp
plot(lmComp)
opt.pen<-lmComp$finalModel$lambdaOpt #penalty term
coef(lmComp$finalModel,opt.pen)

# Output the predictions for the test set to a csv file
# Select only these variables from the testing dataset
# Get the names of the variables used in the model
var_names <- lmComp$finalModel$xNames
new_test <- test[, c(var_names, "neighborhood_veenker")]
id_col <- test$id
stepwise_pred <- predict(lmComp, newdata = as.matrix(new_test))

# Save predictions
data.frame(id = id_col, SalePrice = exp(stepwise_pred)) %>%
  dplyr::select(id, SalePrice) %>%
  write_csv("Predictions/complexlm_predictions.csv")

# stepwise_pred %>%
#   data.frame() %>%
#   rownames_to_column(var = "id") %>%
#   mutate(SalePrice = exp(stepwise_pred)) %>%
#   dplyr::select(id, SalePrice) %>%
#   write_csv("Predictions/complexlm_predictions.csv")
RMSE
0.141264697972528
Rsquared
0.875014600410696
MAE
0.10386416888613
A data.frame: 34 Γ— 1
Overall
<dbl>
log_gr_liv_area0.404857597
neighborhood_idotrr0.205928490
neighborhood_gilbert0.156110544
log_lot_area0.139474710
neighborhood_edwards0.131285722
neighborhood_br_dale0.110631767
neighborhood_old_town0.105102551
neighborhood_sawyer_w0.104842305
neighborhood_crawfor0.096610610
sale_type_con0.095257832
neighborhood_mitchel0.092102541
neighborhood_nw_ames0.090806304
neighborhood_meadow_v0.090050461
neighborhood_sawyer0.076540423
neighborhood_collg_cr0.072704174
neighborhood_timber0.069457450
neighborhood_stone_br0.069004205
neighborhood_blmngtn0.052052229
fireplace_qu_none0.051729657
lot_config_fr20.048520536
neighborhood_swisu0.046572320
neighborhood_n_ames0.043194588
neighborhood_somerst0.042896560
neighborhood_no_ridge0.032071842
house_style1story0.029014767
neighborhood_brk_side0.022692580
exter_qual_gd0.018449066
neighborhood_blueste0.011140420
overall_qual_20.007509962
neighborhood_clear_cr0.006013703
neighborhood_nridg_ht0.005576222
neighborhood_n_pk_vill0.004286030
year_built0.002918634
year_remod_add0.002635765
glmnet 

1458 samples
  37 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1166, 1166, 1167, 1166, 1167 
Resampling results across tuning parameters:

  alpha  lambda        RMSE       Rsquared   MAE      
  0.10   0.0006543964  0.1457942  0.8677977  0.1072713
  0.10   0.0065439640  0.1459272  0.8675386  0.1071822
  0.10   0.0654396404  0.1498018  0.8651849  0.1085737
  0.55   0.0006543964  0.1458369  0.8676922  0.1072402
  0.55   0.0065439640  0.1461345  0.8672977  0.1072563
  0.55   0.0654396404  0.1665586  0.8475884  0.1189285
  1.00   0.0006543964  0.1458812  0.8676002  0.1072906
  1.00   0.0065439640  0.1471834  0.8656324  0.1081394
  1.00   0.0654396404  0.1856431  0.8263513  0.1338341

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0.1 and lambda = 0.0006543964.
37 x 1 sparse Matrix of class "dgCMatrix"
                                 s1
(Intercept)            -3.384558508
log_gr_liv_area         0.404857597
log_lot_area            0.139474710
overall_qual_2          0.007509962
year_built              0.002918634
year_remod_add          0.002635765
log_total_bsmt_sf       .          
log_garage_area         .          
lot_config_fr2         -0.048520536
house_style1story       0.029014767
exter_qual_gd          -0.018449066
fireplace_qu_none      -0.051729657
sale_type_con           0.095257832
neighborhood_blmngtn   -0.052052229
neighborhood_blueste   -0.011140420
neighborhood_br_dale   -0.110631767
neighborhood_brk_side  -0.022692580
neighborhood_clear_cr  -0.006013703
neighborhood_collg_cr  -0.072704174
neighborhood_crawfor    0.096610610
neighborhood_edwards   -0.131285722
neighborhood_gilbert   -0.156110544
neighborhood_idotrr    -0.205928490
neighborhood_meadow_v  -0.090050461
neighborhood_mitchel   -0.092102541
neighborhood_n_ames    -0.043194588
neighborhood_no_ridge   0.032071842
neighborhood_n_pk_vill  0.004286030
neighborhood_nridg_ht   0.005576222
neighborhood_nw_ames   -0.090806304
neighborhood_old_town  -0.105102551
neighborhood_sawyer    -0.076540423
neighborhood_sawyer_w  -0.104842305
neighborhood_somerst   -0.042896560
neighborhood_stone_br   0.069004205
neighborhood_swisu     -0.046572320
neighborhood_timber    -0.069457450
No description has been provided for this image

The linear regression model with transformations does better than the interpretable model. The RMSE is 0.0152 on the log scale which translates to roughly a 16% multiplicative change on the original scale. Interpreting this is difficult due to the complexity of the model, but this approximately corresponds to a $27,000 error in the predicted sale price. This is a significant improvement over the $30,000 error from the interpretable model.

From the coefficients, we can see that the penalized regression included most of the coefficients from the previous model. Total basement area and garage are ended up being excluded, as well as Neghborhoods blmngtn and blueste.

Because we are still using a linear regression model, we must check the assumptions of the model:

InΒ [50]:
# Plot the residuals of lmComp

# Choose a lambda value
lambda <- lmComp$bestTune$lambda

# Get predictions for this lambda
predictions <- predict(lmComp$finalModel, newx = as.matrix(train[, c(var_names)]), s = lambda)

# Calculate residuals
residuals <- train$log_sale_price - predictions

# Plot residuals
ggplot() +
  geom_point(aes(x = predictions, y = residuals)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(x = "Fitted values", y = "Residuals", title = "Residuals vs Fitted values")

## Q-Q plot
qqnorm(residuals)
No description has been provided for this image
No description has been provided for this image

The residuals for this model don't show any evidence of non-linearity or non-constant variance (heteroscedasticity). There is no evidence of non-normality. There are two influential points that could be addressed if there was more time. Because of the transformations, the more complex linear model better meets the assumptions of linear regression than the interpretable model.

InΒ [51]:
# Non-Parametric model
library(randomForest)
library(ggplot2)
set.seed(1234)

predictor_vars <- c(
  "log_gr_liv_area", "log_lot_area", "overall_qual_2", "year_built", "year_remod_add",
  "log_total_bsmt_sf", "log_garage_area", "lot_config_fr2", "house_style1story", "exter_qual_gd",
  "fireplace_qu_none", "sale_type_con"#, ". -sale_price" #Can include "." to make really complex
) %>% paste(collapse = " * ")
neighborhood_vars <- grep("neighborhood", colnames(train), value = TRUE) %>% paste(collapse = " + ")
terms <- (paste(predictor_vars, neighborhood_vars, sep = " * "))
formula <- as.formula(paste("log_sale_price ~", terms, "- neighborhood_veenker"))
library(dplyr)

#removing rows with infinity in one of the values

df <- train[!is.infinite(rowSums(train)),]

rf.fit <- randomForest(formula, data = df,ntree=500)


#prediction on test case
df_test <- test[,!is.na(colSums(test))]
df_test <- df_test[!is.infinite(rowSums(df_test)),]


df_test['sale_price'] <- predict(rf.fit, newdata= df_test)

print(rf.fit)

plot(rf.fit)

## Visualize variable importance ----------------------------------------------

# Get variable importance from the model fit
ImpData <- as.data.frame(importance(rf.fit))
ImpData$Var.Names <- row.names(ImpData)

ggplot(ImpData, aes(x=Var.Names, y=`IncNodePurity`)) +
  geom_segment( aes(x=Var.Names, xend=Var.Names, y=0, yend=`IncNodePurity`), color="skyblue") +
  geom_point(aes(size = IncNodePurity), color="blue", alpha=0.6) +
  theme_light() +
  coord_flip() +
  theme(
    legend.position="bottom",
    panel.grid.major.y = element_blank(),
    panel.border = element_blank(),
    axis.ticks.y = element_blank()
  )
randomForest 4.7-1.1

Type rfNews() to see new features/changes/bug fixes.


Attaching package: β€˜randomForest’


The following object is masked from β€˜package:psych’:

    outlier


The following object is masked from β€˜package:dplyr’:

    combine


The following object is masked from β€˜package:ggplot2’:

    margin


Call:
 randomForest(formula = formula, data = df, ntree = 500) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 12

          Mean of squared residuals: 0.01842256
                    % Var explained: 87.03
No description has been provided for this image
No description has been provided for this image

The more complex linear regression model improved on the interpretable linear regression and the random forest further improved on that. The increase in predictive power, however, comes at the cost of interpretability of the model. The complex linear regression could be interpreted with some effort, but the random forest is closer to a black box that can only be used for prediction.

ConclusionΒΆ

The first objective was to build a linear regression to explain some of the variation in Sale Price of homes in the Ames, IA dataset. We showed that it is possible to build a linear regression that is useful for interpreting the effects of variables of interest, but that this type of model was not the best choice for predictive accuracy. We recommend this type of model for a person who is interested in understanding the effects of variables on housing sale prices, but not necessarily for predicting the sale price of a home. For example, the could be very useful for a developer who is deciding what features to include in a new development.

The second objective was to build a model that would be useful for predicting the sale price of homes in the Ames, IA dataset. We showed that a random forest model was the best choice for predictive accuracy, but that this model was not useful for interpreting the effects of variables of interest. A more complex linear regression model provides a compromise of both ends of the spectrum. Which model to use depends on the needs of the user. We recommend the random forest model for a person who is interested in predicting the sale price of a home without needing to understand the effects of variables. For example, this could be very useful for a real estate agent who is trying to price a home for sale.

The scope of inference for this work is limited to house prices in Ames, IA during the time period this data was taken. Housing markets are very localized and can change drastically with time. Nonetheless, we believe the models to be generalizeable provided they are trained on data from the target population. Because it is an observational study, no causation can be implied. With more time or computing power, the authors believe there is room to fit even more complex models. For example, a complex linear regression including interaction terms in order to capture non-linear effects of variables. Or a random forest model with more trees and more variables.